TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image Classification	ImageNet	PiT-B	Top 1 Accuracy	84%	# 336
Image Classification	ImageNet	PiT-B	Number of params	73.8M	# 797
Image Classification	ImageNet	PiT-B	GFLOPs	12.5	# 318
Image Classification	ImageNet	PiT-S	Top 1 Accuracy	81.9%	# 543
Image Classification	ImageNet	PiT-S	Number of params	23.5M	# 578
Image Classification	ImageNet	PiT-S	GFLOPs	2.9	# 171
Image Classification	ImageNet	PiT-XS	Top 1 Accuracy	79.1%	# 714
Image Classification	ImageNet	PiT-XS	Number of params	10.6M	# 480
Image Classification	ImageNet	PiT-XS	GFLOPs	1.4	# 128
Image Classification	ImageNet	PiT-Ti	Top 1 Accuracy	74.6%	# 902
Image Classification	ImageNet	PiT-Ti	Number of params	4.9M	# 402
Image Classification	ImageNet	PiT-Ti	GFLOPs	0.7	# 83

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/rethinking-spatial-dimensions-of-vision/image-classification-on-imagenet)](https://paperswithcode.com/sota/image-classification-on-imagenet?p=rethinking-spatial-dimensions-of-vision)`

Rethinking Spatial Dimensions of Vision Transformers

ICCV 2021 · Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh ·

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of spatial dimension conversion and its effectiveness on transformer-based architecture. We particularly attend to the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection, and robustness evaluation. Source codes and ImageNet models are available at https://github.com/naver-ai/pit

PDF Abstract ICCV 2021 PDF ICCV 2021 Abstract

Code

Add Remove Mark official

naver-ai/pit official

236

rwightman/pytorch-image-models

29,680

BR-IDL/PaddleViT

1,183

martinsbruveris/tensorflow-image-mo…

279

naver-ai/pflayer

See all 10 implementations

Tasks

Add Remove

Dimensionality Reduction

Image Classification

object-detection

Object Detection

Datasets

ImageNet

MS COCO

ImageNet-A

Results from the Paper

Edit

Ranked #333 on Image Classification on ImageNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image Classification	ImageNet	PiT-B	Top 1 Accuracy	84%	# 336	Compare
			Number of params	73.8M	# 797	Compare
			GFLOPs	12.5	# 318	Compare
Image Classification	ImageNet	PiT-S	Top 1 Accuracy	81.9%	# 543	Compare
			Number of params	23.5M	# 578	Compare
			GFLOPs	2.9	# 171	Compare
Image Classification	ImageNet	PiT-XS	Top 1 Accuracy	79.1%	# 714	Compare
			Number of params	10.6M	# 480	Compare
			GFLOPs	1.4	# 128	Compare
Image Classification	ImageNet	PiT-Ti	Top 1 Accuracy	74.6%	# 902	Compare
			Number of params	4.9M	# 402	Compare
			GFLOPs	0.7	# 83	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer • Vision Transformer

Edit Social Preview

Rethinking Spatial Dimensions of Vision Transformers

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove