LocalViT

Introduced by Li et al. in LocalViT: Bringing Locality to Vision Transformers

LocalViT aims to introduce depthwise convolutions to enhance local features modeling capability of ViTs. The network, as shown in Figure (c), brings localist mechanism into transformers through the depth-wise convolution (denoted by "DW"). To cope with the convolution operation, the conversation between sequence and image feature map is added by "Seq2Img" and "Img2Seq". The computation is as follows:

$$ \mathbf{Y}^{r}=f\left(f\left(\mathbf{Z}^{r} \circledast \mathbf{W}_{1}^{r} \right) \circledast \mathbf{W}_d \right) \circledast \mathbf{W}_2^{r} $$

where $\mathbf{W}_{d} \in \mathbb{R}^{\gamma d \times 1 \times k \times k}$ is the kernel of the depth-wise convolution.

The input (sequence of tokens) is first reshaped to a feature map rearranged on a 2D lattice. Two convolutions along with a depth-wise convolution are applied to the feature map. The feature map is reshaped to a sequence of tokens which are used as by the self-attention of the network transformer layer.

Source: LocalViT: Bringing Locality to Vision Transformers

Read Paper See Code

Papers

Paper	Code	Results	Date	Stars

Tasks

Task	Papers	Share
Image Classification	1	100.00%

Usage Over Time

This feature is experimental; we are continuously improving our matching algorithm.

Components

Component	Type	Add Remove
🤖 No Components Found	You can add them if they exist; e.g. Mask R-CNN uses RoIAlign

Categories

Add Remove

Vision Transformers