TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Semantic Segmentation	ADE20K val	DeiT-L	mIoU	55.6	# 26
Semantic Segmentation	ADE20K val	DeiT-B	mIoU	54.1	# 33
Image Classification	ImageNet	ViT-S @224 (DeiT III, 21k)	Top 1 Accuracy	83.1%	# 427
Image Classification	ImageNet	ViT-S @384 (DeiT III)	Top 1 Accuracy	83.4%	# 395
Image Classification	ImageNet	ViT-S @384 (DeiT III)	Number of params	22M	# 558
Image Classification	ImageNet	ViT-S @384 (DeiT III)	GFLOPs	15.5	# 341
Image Classification	ImageNet	ViT-L @224 (DeiT III)	Top 1 Accuracy	84.9%	# 266
Image Classification	ImageNet	ViT-B @224 (DeiT III, 21k)	Top 1 Accuracy	85.7%	# 201
Image Classification	ImageNet	ViT-B @384 (DeiT III, 21k)	Top 1 Accuracy	86.7%	# 127
Image Classification	ImageNet	ViT-H @224 (DeiT III)	Top 1 Accuracy	85.2%	# 240
Image Classification	ImageNet	ViT-B @384 (DeiT III)	Top 1 Accuracy	85.0%	# 256
Image Classification	ImageNet	ViT-B @384 (DeiT III)	Number of params	87M	# 823
Image Classification	ImageNet	ViT-B @224 (DeiT III)	Top 1 Accuracy	83.8%	# 359
Image Classification	ImageNet	ViT-L	Top 1 Accuracy	85.8%	# 188
Image Classification	ImageNet	ViT-L	Number of params	304.8M	# 915
Image Classification	ImageNet	ViT-L	GFLOPs	191.2	# 468
Image Classification	ImageNet	ViT-S @224 (DeiT III)	Top 1 Accuracy	81.4%	# 587
Image Classification	ImageNet ReaL	ViT-H @224 (DeiT III, 21k)	Top 1 Accuracy	87.2%	# 3
Image Classification	ImageNet ReaL	ViT-H @224 (DeiT III, 21k)	Number of params	632M	# 1
Image Classification	ImageNet ReaL	ViT-L @224 (DeiT III, 21k)	Top 1 Accuracy	87.0%	# 4
Image Classification	ImageNet ReaL	ViT-L @384 (DeiT III, 21k)	Top 1 Accuracy	87.7%	# 1
Image Classification	ImageNet ReaL	ViT-L @384 (DeiT III, 21k)	Number of params	304M	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-iii-revenge-of-the-vit/image-classification-on-imagenet-real)](https://paperswithcode.com/sota/image-classification-on-imagenet-real?p=deit-iii-revenge-of-the-vit)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-iii-revenge-of-the-vit/semantic-segmentation-on-ade20k-val)](https://paperswithcode.com/sota/semantic-segmentation-on-ade20k-val?p=deit-iii-revenge-of-the-vit)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/deit-iii-revenge-of-the-vit/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=deit-iii-revenge-of-the-vit)`

DeiT III: Revenge of the ViT

14 Apr 2022 · Hugo Touvron, Matthieu Cord, Hervé Jégou ·

A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks. It has limited built-in architectural priors, in contrast to more recent architectures that incorporate priors either about the input data or of specific tasks. Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT. In this paper, we revisit the supervised training of ViTs. Our procedure builds upon and simplifies a recipe introduced for training ResNet-50. It includes a new simple data-augmentation procedure with only 3 augmentations, closer to the practice in self-supervised learning. Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the performance of our ViT trained with supervision is comparable to that of more recent architectures. Our results could serve as better baselines for recent self-supervised approaches demonstrated on ViT.

PDF Abstract

Code

Add Remove Mark official

facebookresearch/deit official

3,894

rwightman/pytorch-image-models

30,029

open-mmlab/mmclassification

3,199

alibaba/EasyCV

1,695

affjljoo3581/deit3-jax

See all 10 implementations

Tasks

Add Remove

Data Augmentation

Image Classification

Self-Supervised Learning

Semantic Segmentation

Transfer Learning

Datasets

CIFAR-10

ImageNet

CIFAR-100

Oxford 102 Flower

ADE20K ImageNet-1K

Results from the Paper

Edit

Ranked #1 on Image Classification on ImageNet ReaL (Number of params metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K val	DeiT-L	mIoU	55.6	# 26	Compare
Semantic Segmentation	ADE20K val	DeiT-B	mIoU	54.1	# 33	Compare
Image Classification	ImageNet	ViT-S @224 (DeiT III, 21k)	Top 1 Accuracy	83.1%	# 427	Compare
Image Classification	ImageNet	ViT-S @384 (DeiT III)	Top 1 Accuracy	83.4%	# 395	Compare
			Number of params	22M	# 558	Compare
			GFLOPs	15.5	# 341	Compare
Image Classification	ImageNet	ViT-L @224 (DeiT III)	Top 1 Accuracy	84.9%	# 266	Compare
Image Classification	ImageNet	ViT-B @224 (DeiT III, 21k)	Top 1 Accuracy	85.7%	# 201	Compare
Image Classification	ImageNet	ViT-B @384 (DeiT III, 21k)	Top 1 Accuracy	86.7%	# 127	Compare
Image Classification	ImageNet	ViT-H @224 (DeiT III)	Top 1 Accuracy	85.2%	# 240	Compare
Image Classification	ImageNet	ViT-B @384 (DeiT III)	Top 1 Accuracy	85.0%	# 256	Compare
Image Classification	ImageNet	ViT-B @384 (DeiT III)	Number of params	87M	# 823	Compare
Image Classification	ImageNet	ViT-B @224 (DeiT III)	Top 1 Accuracy	83.8%	# 359	Compare
Image Classification	ImageNet	ViT-L	Top 1 Accuracy	85.8%	# 188	Compare
			Number of params	304.8M	# 915	Compare
			GFLOPs	191.2	# 468	Compare
Image Classification	ImageNet	ViT-S @224 (DeiT III)	Top 1 Accuracy	81.4%	# 587	Compare
Image Classification	ImageNet ReaL	ViT-H @224 (DeiT III, 21k)	Top 1 Accuracy	87.2%	# 3	Compare
Image Classification	ImageNet ReaL	ViT-H @224 (DeiT III, 21k)	Number of params	632M	# 1	Compare
Image Classification	ImageNet ReaL	ViT-L @224 (DeiT III, 21k)	Top 1 Accuracy	87.0%	# 4	Compare
Image Classification	ImageNet ReaL	ViT-L @384 (DeiT III, 21k)	Top 1 Accuracy	87.7%	# 1	Compare
Image Classification	ImageNet ReaL	ViT-L @384 (DeiT III, 21k)	Number of params	304M	# 2	Compare

Methods

Add Remove

3-Augment • Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • FixRes • Label Smoothing • Layer Normalization • LayerScale • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

DeiT III: Revenge of the ViT

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove