PVT, or Pyramid Vision Transformer, is a type of vision transformer that utilizes a pyramid structure to make it an effective backbone for dense prediction tasks. Specifically it allows for more fine-grained inputs (4 x 4 pixels per patch) to be used, while simultaneously shrinking the sequence length of the Transformer as it deepens - reducing the computational cost. Additionally, a spatial-reduction attention (SRA) layer is used to further reduce the resource consumption when learning high-resolution features.
The entire model is divided into four stages, each of which is comprised of a patch embedding layer and a $\mathcal{L}_{i}$-layer Transformer encoder. Following a pyramid structure, the output resolution of the four stages progressively shrinks from high (4-stride) to low (32-stride).
Source: Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without ConvolutionsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Object Detection | 6 | 16.67% |
Semantic Segmentation | 5 | 13.89% |
Instance Segmentation | 3 | 8.33% |
Image Classification | 3 | 8.33% |
Self-Supervised Learning | 2 | 5.56% |
Medical Image Segmentation | 1 | 2.78% |
Action Recognition | 1 | 2.78% |
Temporal Action Localization | 1 | 2.78% |
Computational Efficiency | 1 | 2.78% |