TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Visual Question Answering (VQA)	ActivityNet	BLIP-2 T5	ClipMatch@1	53.39	# 1
Visual Question Answering (VQA)	ActivityNet	BLIP-2 T5	ClipMatch@5	74.71	# 1
Visual Question Answering (VQA)	ActivityNet	BLIP-2 T5	Contains	15.70	# 1
Visual Question Answering (VQA)	ActivityNet	BLIP-2 T5	ExactMatch	7.07	# 1
Visual Question Answering (VQA)	ActivityNet	BLIP-2 T5	Follow-up ClipMatch@1	62.02	# 1
Visual Question Answering (VQA)	ActivityNet	BLIP-2 T5	Follow-up ClipMatch@5	75.13	# 1
Visual Question Answering (VQA)	ActivityNet	BLIP-2 T5	Follow-up Contains	18.09	# 1
Visual Question Answering (VQA)	ActivityNet	BLIP-2 T5	Follow-up ExactMatch	8.84	# 1
Visual Question Answering (VQA)	COCO	InstructBLIP Vicuna	ClipMatch@1	59.58	# 1
Visual Question Answering (VQA)	COCO	InstructBLIP Vicuna	ClipMatch@5	73.32	# 1
Visual Question Answering (VQA)	COCO	InstructBLIP Vicuna	Contains	27.52	# 1
Visual Question Answering (VQA)	COCO	InstructBLIP Vicuna	ExactMatch	26.50	# 1
Visual Question Answering (VQA)	ImageNet	BLIP-2 OPT	Contains	35.49	# 1
Visual Question Answering (VQA)	ImageNet	BLIP-2 OPT	ClipMatch@1	57.10	# 1
Visual Question Answering (VQA)	ImageNet	BLIP-2 OPT	ClipMatch@5	77.24	# 1
Visual Question Answering (VQA)	ImageNet	BLIP-2 OPT	Follow-up ClipMatch@1	67.22	# 1
Visual Question Answering (VQA)	ImageNet	BLIP-2 OPT	Follow-up ClipMatch@5	83.54	# 1
Visual Question Answering (VQA)	ImageNet	BLIP-2 OPT	Follow-up Contains	40.31	# 1
Visual Question Answering (VQA)	ImageNet	BLIP-2 OPT	Follow-up ExactMatch	2.54	# 1
Visual Question Answering (VQA)	ImageNet	BLIP-2 OPT	ExactMatch	0.87	# 1
Visual Question Answering (VQA)	OVAD benchmark	BLIP	Contains w. Synonyms	45.70	# 1
Visual Question Answering (VQA)	OVAD benchmark	BLIP	ExactMatch w. Synonyms	36.99	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/open-ended-vqa-benchmarking-of-vision/visual-question-answering-vqa-on-activitynet-1)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-activitynet-1?p=open-ended-vqa-benchmarking-of-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/open-ended-vqa-benchmarking-of-vision/visual-question-answering-vqa-on-coco)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-coco?p=open-ended-vqa-benchmarking-of-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/open-ended-vqa-benchmarking-of-vision/visual-question-answering-vqa-on-imagenet)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-imagenet?p=open-ended-vqa-benchmarking-of-vision)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/open-ended-vqa-benchmarking-of-vision/visual-question-answering-vqa-on-ovad)](https://paperswithcode.com/sota/visual-question-answering-vqa-on-ovad?p=open-ended-vqa-benchmarking-of-vision)`

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

11 Feb 2024 · Simon Ging, María A. Bravo, Thomas Brox ·

The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which allows a granular evaluation of text-generative vision-language models and their comparison with discriminative vision-language models. To improve the assessment of coarse answers on fine-grained classification tasks, we suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category. Finally, we compare traditional NLP and LLM-based metrics for the problem of evaluating model predictions given ground-truth answers. We perform a human evaluation study upon which we base our decision on the final metric. We apply our benchmark to a suite of vision-language models and show a detailed comparison of their abilities on object, action, and attribute classification. Our contributions aim to lay the foundation for more precise and meaningful assessments, facilitating targeted progress in the exciting field of vision-language modeling.

PDF Abstract

Code

Add Remove Mark official

lmb-freiburg/ovqa official

Tasks

Add Remove

Open Vocabulary Attribute Detection

Visual Question Answering

Visual Question Answering (VQA)

Datasets

ImageNet

MS COCO

Visual Question Answering

ActivityNet

GQA

OVAD benchmark

Results from the Paper

Add Remove

Ranked #1 on Visual Question Answering (VQA) on ActivityNet

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering (VQA)	ActivityNet	BLIP-2 T5	ClipMatch@1	53.39	# 1	Compare
			ClipMatch@5	74.71	# 1	Compare
			Contains	15.70	# 1	Compare
			ExactMatch	7.07	# 1	Compare
			Follow-up ClipMatch@1	62.02	# 1	Compare
			Follow-up ClipMatch@5	75.13	# 1	Compare
			Follow-up Contains	18.09	# 1	Compare
			Follow-up ExactMatch	8.84	# 1	Compare
Visual Question Answering (VQA)	COCO	InstructBLIP Vicuna	ClipMatch@1	59.58	# 1	Compare
			ClipMatch@5	73.32	# 1	Compare
			Contains	27.52	# 1	Compare
			ExactMatch	26.50	# 1	Compare
Visual Question Answering (VQA)	ImageNet	BLIP-2 OPT	Contains	35.49	# 1	Compare
			ClipMatch@1	57.10	# 1	Compare
			ClipMatch@5	77.24	# 1	Compare
			Follow-up ClipMatch@1	67.22	# 1	Compare
			Follow-up ClipMatch@5	83.54	# 1	Compare
			Follow-up Contains	40.31	# 1	Compare
			Follow-up ExactMatch	2.54	# 1	Compare
			ExactMatch	0.87	# 1	Compare
Visual Question Answering (VQA)	OVAD benchmark	BLIP	Contains w. Synonyms	45.70	# 1	Compare
Visual Question Answering (VQA)	OVAD benchmark	BLIP	ExactMatch w. Synonyms	36.99	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove