VL-BERT is pre-trained on a large-scale image-captions dataset together with text-only corpus. The input to the model are either words from the input sentences or regions-of-interest (RoI) from input images. It can be fine-tuned to fit most visual-linguistic downstream tasks. Its backbone is a multi-layer bidirectional Transformer encoder, modified to accommodate visual contents, and new type of visual feature embedding to the input feature embeddings. VL-BERT takes both visual and linguistic elements as input, represented as RoIs in images and subwords in input sentences. Four different types of embeddings are used to represent each input: token embedding, visual feature embedding, segment embedding, and sequence position embedding. VL-BERT is pre-trained using Conceptual Captions and text-only datasets. Two pre-training tasks are used: masked language modeling with visual clues, and masked RoI classification with linguistic clues.
Source: VL-BERT: Pre-training of Generic Visual-Linguistic RepresentationsPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Decoder | 1 | 7.14% |
Image Captioning | 1 | 7.14% |
Machine Translation | 1 | 7.14% |
Multimodal Machine Translation | 1 | 7.14% |
Text Generation | 1 | 7.14% |
Translation | 1 | 7.14% |
Image-text matching | 1 | 7.14% |
Language Modelling | 1 | 7.14% |
Question Answering | 1 | 7.14% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |