TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Long-tail Learning	CIFAR-100-LT (ρ=100)	DeiT-LT	Error Rate	44.4	# 11
Long-tail Learning	CIFAR-100-LT (ρ=50)	DeiT-LT	Error Rate	39.5	# 8
Long-tail Learning	CIFAR-10-LT (ρ=100)	DeiT-LT	Error Rate	12.5	# 5
Long-tail Learning	CIFAR-10-LT (ρ=50)	DeiT-LT	Error Rate	10.2	# 3
Long-tail Learning	ImageNet-LT	DeiT-LT	Top-1 Accuracy	59.1	# 15
Image Classification	iNaturalist	b_22DeiT-LT(ours)	Overall	75.1	# 1
Long-tail Learning	iNaturalist 2018	DeiT-LT	Top-1 Accuracy	75.1%	# 13

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-lt-distillation-strikes-back-for-vision/image-classification-on-inaturalist)](https://paperswithcode.com/sota/image-classification-on-inaturalist?p=deit-lt-distillation-strikes-back-for-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-lt-distillation-strikes-back-for-vision/long-tail-learning-on-cifar-10-lt-r-50)](https://paperswithcode.com/sota/long-tail-learning-on-cifar-10-lt-r-50?p=deit-lt-distillation-strikes-back-for-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-lt-distillation-strikes-back-for-vision/long-tail-learning-on-cifar-10-lt-r-100)](https://paperswithcode.com/sota/long-tail-learning-on-cifar-10-lt-r-100?p=deit-lt-distillation-strikes-back-for-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-lt-distillation-strikes-back-for-vision/long-tail-learning-on-cifar-100-lt-r-50)](https://paperswithcode.com/sota/long-tail-learning-on-cifar-100-lt-r-50?p=deit-lt-distillation-strikes-back-for-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-lt-distillation-strikes-back-for-vision/long-tail-learning-on-cifar-100-lt-r-100)](https://paperswithcode.com/sota/long-tail-learning-on-cifar-100-lt-r-100?p=deit-lt-distillation-strikes-back-for-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-lt-distillation-strikes-back-for-vision/long-tail-learning-on-inaturalist-2018)](https://paperswithcode.com/sota/long-tail-learning-on-inaturalist-2018?p=deit-lt-distillation-strikes-back-for-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-lt-distillation-strikes-back-for-vision/long-tail-learning-on-imagenet-lt)](https://paperswithcode.com/sota/long-tail-learning-on-imagenet-lt?p=deit-lt-distillation-strikes-back-for-vision)`

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

3 Apr 2024 · Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan, R. Venkatesh Babu ·

Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks, improving generalization for tail classes. Further, to mitigate overfitting, we propose distilling from a flat CNN teacher, which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token becomes an expert on the tail classes, and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.

PDF Abstract

Code

Add Remove Mark official

val-iisc/DeiT-LT official

Tasks

Add Remove

Image Classification

Inductive Bias

Long-tail Learning

Datasets

CIFAR-10

CIFAR-100

iNaturalist ImageNet-LT

Results from the Paper

Edit

Ranked #1 on Image Classification on iNaturalist (Overall metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Long-tail Learning	CIFAR-100-LT (ρ=100)	DeiT-LT	Error Rate	44.4	# 11	Compare
Long-tail Learning	CIFAR-100-LT (ρ=50)	DeiT-LT	Error Rate	39.5	# 8	Compare
Long-tail Learning	CIFAR-10-LT (ρ=100)	DeiT-LT	Error Rate	12.5	# 5	Compare
Long-tail Learning	CIFAR-10-LT (ρ=50)	DeiT-LT	Error Rate	10.2	# 3	Compare
Long-tail Learning	ImageNet-LT	DeiT-LT	Top-1 Accuracy	59.1	# 15	Compare
Image Classification	iNaturalist	b_22DeiT-LT(ours)	Overall	75.1	# 1	Compare
Long-tail Learning	iNaturalist 2018	DeiT-LT	Top-1 Accuracy	75.1%	# 13	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Focus • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove