TASK
DATASET
MODEL
METRIC NAME
METRIC VALUE
GLOBAL RANK
EXTRA DATA
REMOVE
Image Captioning
COCO Captions
BLIP-2 ViT-G OPT 2.7B (zero-shot)
BLEU-4
43.7
# 4
Image Captioning
COCO Captions
BLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDER
145.8
# 6
Image Captioning
COCO Captions
BLIP-2 ViT-G OPT 6.7B (zero-shot)
BLEU-4
43.5
# 5
Image Captioning
COCO Captions
BLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDER
145.2
# 8
Image Captioning
COCO Captions
BLIP-2 ViT-G FlanT5 XL (zero-shot)
BLEU-4
42.4
# 8
Image Captioning
COCO Captions
BLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDER
144.5
# 9
Image-to-Text Retrieval
Flickr30k
BLIP-2 ViT-L (zero-shot, 1K test set)
Recall@1
96.9
# 5
Image-to-Text Retrieval
Flickr30k
BLIP-2 ViT-L (zero-shot, 1K test set)
Recall@5
100
# 1
Image-to-Text Retrieval
Flickr30k
BLIP-2 ViT-L (zero-shot, 1K test set)
Recall@10
100
# 1
Image Retrieval
Flickr30k
BLIP-2 ViT-G (zero-shot, 1K test set)
Recall@5
98.1
# 1
Image Retrieval
Flickr30k
BLIP-2 ViT-G (zero-shot, 1K test set)
Recall@10
98.9
# 1
Image Retrieval
Flickr30k
BLIP-2 ViT-G (zero-shot, 1K test set)
Recall@1
89.7
# 1
Image-to-Text Retrieval
Flickr30k
BLIP-2 ViT-G (zero-shot, 1K test set)
Recall@1
97.6
# 2
Image-to-Text Retrieval
Flickr30k
BLIP-2 ViT-G (zero-shot, 1K test set)
Recall@5
100
# 1
Image-to-Text Retrieval
Flickr30k
BLIP-2 ViT-G (zero-shot, 1K test set)
Recall@10
100
# 1
Image Retrieval
Flickr30k
BLIP-2 ViT-L (zero-shot, 1K test set)
Recall@5
97.6
# 2
Image Retrieval
Flickr30k
BLIP-2 ViT-L (zero-shot, 1K test set)
Recall@10
98.9
# 1
Image Retrieval
Flickr30k
BLIP-2 ViT-L (zero-shot, 1K test set)
Recall@1
88.6
# 2
Visual Question Answering (VQA)
GQA test-dev
BLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy
44.2
# 9
Visual Question Answering (VQA)
GQA test-dev
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy
44.7
# 7
Visual Question Answering (VQA)
GQA test-dev
BLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy
33.9
# 13
Visual Question Answering (VQA)
GQA test-dev
BLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy
34.6
# 12
Visual Question Answering (VQA)
GQA test-dev
BLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy
36.4
# 11
Visual Question Answering (VQA)
GQA test-dev
BLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy
44.4
# 8
Visual Question Answering (VQA)
InfiMM-Eval
BLIP-2-OPT2.7B
Overall score
19.31
# 12
Visual Question Answering (VQA)
InfiMM-Eval
BLIP-2-OPT2.7B
Deductive
2.76
# 14
Visual Question Answering (VQA)
InfiMM-Eval
BLIP-2-OPT2.7B
Abductive
18.96
# 12
Visual Question Answering (VQA)
InfiMM-Eval
BLIP-2-OPT2.7B
Analogical
7.5
# 12
Visual Question Answering (VQA)
InfiMM-Eval
BLIP-2-OPT2.7B
Params
3B
# 1
Visual Question Answering (VQA)
InfoSeek
BLIP2
Accuracy
14.6
# 6
visual instruction following
LLaVA-Bench
BLIP-2
avg score
38.1
# 7
Visual Question Answering
MM-Vet
BLIP-2-12B
GPT-4 score
22.4±0.2
# 90
Visual Question Answering
MM-Vet
BLIP-2-12B
Params
12B
# 1
Image Retrieval
MS COCO
BLIP-2 ViT-G (fine-tuned)
Recall@10
92.6
# 3
Image Retrieval
MS COCO
BLIP-2 ViT-G (fine-tuned)
recall@1
68.3
# 1
Image Retrieval
MS COCO
BLIP-2 ViT-G (fine-tuned)
recall@5
87.7
# 2
Image-to-Text Retrieval
MS COCO
BLIP-2 ViT-L (fine-tuned)
Recall@10
98.0
# 4
Image-to-Text Retrieval
MS COCO
BLIP-2 ViT-L (fine-tuned)
Recall@1
83.5
# 3
Image-to-Text Retrieval
MS COCO
BLIP-2 ViT-L (fine-tuned)
Recall@5
96.0
# 3
Image-to-Text Retrieval
MS COCO
BLIP-2 ViT-G (fine-tuned)
Recall@10
98.5
# 2
Image-to-Text Retrieval
MS COCO
BLIP-2 ViT-G (fine-tuned)
Recall@1
85.4
# 1
Image-to-Text Retrieval
MS COCO
BLIP-2 ViT-G (fine-tuned)
Recall@5
97.0
# 1
Image Retrieval
MS COCO
BLIP-2 ViT-L (fine-tuned)
Recall@10
91.8
# 4
Image Retrieval
MS COCO
BLIP-2 ViT-L (fine-tuned)
recall@1
66.3
# 3
Image Retrieval
MS COCO
BLIP-2 ViT-L (fine-tuned)
recall@5
86.5
# 3
Image Captioning
nocaps-val-in-domain
BLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr
123.7
# 1
Image Captioning
nocaps-val-in-domain
BLIP-2 ViT-G OPT 6.7B (zero-shot)
SPICE
15.8
# 2
Image Captioning
nocaps-val-in-domain
BLIP-2 ViT-G OPT 6.7B (zero-shot)
Pre-train (#images)
1.1B
# 1
Image Captioning
nocaps-val-in-domain
BLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr
123.7
# 1
Image Captioning
nocaps-val-in-domain
BLIP-2 ViT-G FlanT5 XL (zero-shot)
SPICE
16.3
# 1
Image Captioning
nocaps-val-in-domain
BLIP-2 ViT-G FlanT5 XL (zero-shot)
Pre-train (#images)
1.1B
# 1
Image Captioning
nocaps-val-in-domain
BLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr
123
# 3
Image Captioning
nocaps-val-in-domain
BLIP-2 ViT-G OPT 2.7B (zero-shot)
SPICE
15.8
# 2
Image Captioning
nocaps-val-in-domain
BLIP-2 ViT-G OPT 2.7B (zero-shot)
Pre-train (#images)
1.1B
# 1
Image Captioning
nocaps-val-near-domain
BLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr
117.8
# 3
Image Captioning
nocaps-val-near-domain
BLIP-2 ViT-G OPT 2.7B (zero-shot)
SPICE
15.4
# 2
Image Captioning
nocaps-val-near-domain
BLIP-2 ViT-G OPT 2.7B (zero-shot)
Pre-train (#images)
1.1B
# 1
Image Captioning
nocaps-val-near-domain
BLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr
120.2
# 1
Image Captioning
nocaps-val-near-domain
BLIP-2 ViT-G FlanT5 XL (zero-shot)
SPICE
15.9
# 1
Image Captioning
nocaps-val-near-domain
BLIP-2 ViT-G FlanT5 XL (zero-shot)
Pre-train (#images)
1.1B
# 1
Image Captioning
nocaps-val-near-domain
BLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr
119.2
# 2
Image Captioning
nocaps-val-near-domain
BLIP-2 ViT-G OPT 6.7B (zero-shot)
SPICE
15.3
# 3
Image Captioning
nocaps-val-near-domain
BLIP-2 ViT-G OPT 6.7B (zero-shot)
Pre-train (#images)
1.1B
# 1
Image Captioning
nocaps-val-out-domain
BLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr
124.4
# 2
Image Captioning
nocaps-val-out-domain
BLIP-2 ViT-G OPT 6.7B (zero-shot)
SPICE
14.8
# 3
Image Captioning
nocaps-val-out-domain
BLIP-2 ViT-G OPT 6.7B (zero-shot)
Pretrain (#images)
1.1B
# 1
Image Captioning
nocaps-val-out-domain
BLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr
123.4
# 3
Image Captioning
nocaps-val-out-domain
BLIP-2 ViT-G OPT 2.7B (zero-shot)
SPICE
15.1
# 1
Image Captioning
nocaps-val-out-domain
BLIP-2 ViT-G OPT 2.7B (zero-shot)
Pretrain (#images)
1.1B
# 1
Image Captioning
nocaps-val-out-domain
BLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr
124.8
# 1
Image Captioning
nocaps-val-out-domain
BLIP-2 ViT-G FlanT5 XL (zero-shot)
SPICE
15.1
# 1
Image Captioning
nocaps-val-out-domain
BLIP-2 ViT-G FlanT5 XL (zero-shot)
Pretrain (#images)
1.1B
# 1
Image Captioning
nocaps-val-overall
BLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr
121.0
# 2
Image Captioning
nocaps-val-overall
BLIP-2 ViT-G OPT 6.7B (zero-shot)
SPICE
15.3
# 3
Image Captioning
nocaps-val-overall
BLIP-2 ViT-G OPT 6.7B (zero-shot)
Pretrain (#images)
1.1B
# 1
Image Captioning
nocaps-val-overall
BLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr
121.6
# 1
Image Captioning
nocaps-val-overall
BLIP-2 ViT-G FlanT5 XL (zero-shot)
SPICE
15.8
# 1
Image Captioning
nocaps-val-overall
BLIP-2 ViT-G FlanT5 XL (zero-shot)
Pretrain (#images)
1.1B
# 1
Image Captioning
nocaps-val-overall
BLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr
119.7
# 3
Image Captioning
nocaps-val-overall
BLIP-2 ViT-G OPT 2.7B (zero-shot)
SPICE
15.4
# 2
Image Captioning
nocaps-val-overall
BLIP-2 ViT-G OPT 2.7B (zero-shot)
Pretrain (#images)
1.1B
# 1
Visual Question Answering (VQA)
OK-VQA
BLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy
30.2
# 32
Visual Question Answering (VQA)
OK-VQA
BLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy
31.7
# 31
Visual Question Answering (VQA)
OK-VQA
BLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy
36.4
# 29
Visual Question Answering (VQA)
OK-VQA
BLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy
39.4
# 28
Visual Question Answering (VQA)
OK-VQA
BLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy
40.7
# 27
Visual Question Answering (VQA)
OK-VQA
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy
45.9
# 22
Open Vocabulary Attribute Detection
OVAD-Box benchmark
BLIP 2 (pretrained)
mean average precision
25.5
# 2
Medical Visual Question Answering
PMC-VQA
BLIP-2
Accuracy
24.3
# 4
Generative Visual Question Answering
PMC-VQA
BLIP-2
BLEU-1
7.6
# 2
Visual Question Answering (VQA)
PMC-VQA
BLIP-2
Accuracy
24.3
# 4
Visual Question Answering
VQA v2 test-dev
BLIP-2 ViT-G OPT 2.7B (fine-tuned)
Accuracy
81.74
# 4
Visual Question Answering (VQA)
VQA v2 test-dev
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy
65
# 43
Visual Question Answering (VQA)
VQA v2 test-dev
BLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy
49.7
# 55
Visual Question Answering
VQA v2 test-dev
BLIP-2 ViT-G OPT 6.7B (fine-tuned)
Accuracy
82.30
# 1
Visual Question Answering
VQA v2 test-dev
BLIP-2 ViT-G FlanT5 XL (fine-tuned)
Accuracy
81.66
# 5
Visual Question Answering (VQA)
VQA v2 test-dev
BLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy
52.3
# 52
Visual Question Answering (VQA)
VQA v2 test-dev
BLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy
52.6
# 51
Visual Question Answering (VQA)
VQA v2 test-dev
BLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy
62.3
# 49
Visual Question Answering (VQA)
VQA v2 test-dev
BLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy
63
# 48
Visual Question Answering (VQA)
VQA v2 val
BLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy
53.5
# 6
Visual Question Answering (VQA)
VQA v2 val
BLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy
65.2
# 1
Visual Question Answering (VQA)
VQA v2 val
BLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy
54.3
# 5
Visual Question Answering (VQA)
VQA v2 val
BLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy
63.1
# 3
Visual Question Answering (VQA)
VQA v2 val
BLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy
62.6
# 4
Visual Question Answering
VQA v2 val
BLIP-2 ViT-G FlanT5 XL (fine-tuned)
Accuracy
81.55
# 3
Visual Question Answering
VQA v2 val
BLIP-2 ViT-G OPT 2.7B (fine-tuned)
Accuracy
81.59
# 2
Visual Question Answering
VQA v2 val
BLIP-2 ViT-G OPT 6.7B (fine-tuned)
Accuracy
82.19
# 1
Visual Question Answering (VQA)
VQA v2 val
BLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy
50.1
# 7