Swin Transformer

Introduced by Liu et al. in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

The Swin Transformer is a type of Vision Transformer. It builds hierarchical feature maps by merging image patches (shown in gray) in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window (shown in red). It can thus serve as a general-purpose backbone for both image classification and dense recognition tasks. In contrast, previous vision Transformers produce feature maps of a single low resolution and have quadratic computation complexity to input image size due to computation of self-attention globally.

Source: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Semantic Segmentation	75	12.50%
Image Classification	45	7.50%
Object Detection	43	7.17%
Instance Segmentation	22	3.67%
Image Segmentation	21	3.50%
Medical Image Segmentation	19	3.17%
Super-Resolution	18	3.00%
Classification	13	2.17%
Image Super-Resolution	11	1.83%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
Dense Connections	Feedforward Networks
Layer Normalization	Normalization
Multi-Head Attention	Attention Modules
Residual Connection	Skip Connections
Scaled Dot-Product Attention	Attention Mechanisms
Stochastic Depth	Regularization

Categories

Add Remove

Vision Transformers

Image Models