ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

5 Feb 2021  ·  Wonjae Kim, Bokyung Son, Ildoo Kim ·

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance. Our code and pre-trained weights are available at https://github.com/dandelin/vilt.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Zero-Shot Cross-Modal Retrieval COCO 2014 ViLT-B/32 Image-to-text R@1 56.5 # 14
Image-to-text R@5 82.6 # 13
Image-to-text R@10 89.6 # 12
Text-to-image R@1 40.4 # 13
Text-to-image R@5 70 # 12
Text-to-image R@10 81.1 # 10
Cross-Modal Retrieval COCO 2014 ViLT-B/32 Image-to-text R@1 61.5 # 20
Image-to-text R@10 92.7 # 19
Image-to-text R@5 86.3 # 20
Text-to-image R@1 42.7 # 25
Text-to-image R@10 83.1 # 21
Text-to-image R@5 72.9 # 22
Zero-Shot Cross-Modal Retrieval Flickr30k ViLT-B/32 Image-to-text R@1 73.2 # 16
Image-to-text R@5 93.6 # 17
Image-to-text R@10 96.5 # 15
Text-to-image R@1 55 # 17
Text-to-image R@5 82.5 # 17
Text-to-image R@10 89.8 # 15
Cross-Modal Retrieval Flickr30k ViLT-B/32 Image-to-text R@1 83.5 # 12
Image-to-text R@10 98.6 # 12
Image-to-text R@5 96.7 # 12
Text-to-image R@1 64.4 # 13
Text-to-image R@10 93.8 # 13
Text-to-image R@5 88.7 # 13
Multimodal Intent Recognition MMDialog ViLT F1 55.8 # 4
Visual Reasoning NLVR2 Dev ViLT-B/32 Accuracy 75.7 # 12
Visual Reasoning NLVR2 Test ViLT-B/32 Accuracy 76.13 # 13
Image Retrieval PhotoChat ViLT R1 11.5 # 2
R@5 33.8 # 2
R@10 25.6 # 5
Sum(R@1,5,10) 71.0 # 5
Multimodal Intent Recognition PhotoChat ViLT F1 52.4 # 5
Precision 55.4 # 4
Recall 58.9 # 4
Visual Question Answering (VQA) VQA v2 test-dev ViLT-B/32 Accuracy 71.26 # 24

Methods