Efficient ViTs

26 papers with code • 3 benchmarks • 0 datasets

Increasing the efficiency of ViTs without the modification of the architecture. (i.e., Key & Query Sparsification, Token pruning & merging)

Most implemented papers

Training data-efficient image transformers & distillation through attention

facebookresearch/deit 23 Dec 2020

In this work, we produce a competitive convolution-free transformer by training on Imagenet only.

All Tokens Matter: Token Labeling for Training Better Vision Transformers

zihangJiang/TokenLabeling NeurIPS 2021

In this paper, we present token labeling -- a new training objective for training high-performance vision transformers (ViTs).

Fast Vision Transformers with HiLo Attention

ziplab/litv2 26 May 2022

Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map.

Pruning Self-attentions into Convolutional Layers in Single Path

zhuang-group/spvit 23 Nov 2021

Relying on the single-path space, we introduce learnable binary gates to encode the operation choices in MSA layers.

Token Merging: Your ViT But Faster

facebookresearch/tome 17 Oct 2022

Off-the-shelf, ToMe can 2x the throughput of state-of-the-art ViT-L @ 512 and ViT-H @ 518 models on images and 2. 2x the throughput of ViT-L on video with only a 0. 2-0. 3% accuracy drop in each case.

Scalable Vision Transformers with Hierarchical Pooling

MonashAI/HVT ICCV 2021

However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation.

PPT: Token Pruning and Pooling for Efficient Vision Transformers

mindspore-lab/models 3 Oct 2023

Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks.

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

raoyongming/DynamicViT NeurIPS 2021

Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input.

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

VITA-Group/SViTE NeurIPS 2021

For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0. 28% top-1 accuracy, and meanwhile enjoys 49. 32% FLOPs and 4. 40% running time savings.

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

YifanXu74/Evo-ViT 3 Aug 2021

Vision transformers (ViTs) have recently received explosive popularity, but the huge computational cost is still a severe issue.