TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Chart Question Answering	ChartQA	ScreenAI 5B (4.62 B params, w/ OCR)	1:1 Accuracy	76.7	# 5
Visual Question Answering (VQA)	DocVQA test	ScreenAI 5B (4.62 B params, w/OCR)	ANLS	0.8988	# 5
Visual Question Answering (VQA)	InfographicVQA	ScreenAI 5B (4.62 B params, w/ OCR)	ANLS	65.90	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/screenai-a-vision-language-model-for-ui-and/visual-question-answering-vqa-on)](https://paperswithcode.com/sota/visual-question-answering-vqa-on?p=screenai-a-vision-language-model-for-ui-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/screenai-a-vision-language-model-for-ui-and/chart-question-answering-on-chartqa)](https://paperswithcode.com/sota/chart-question-answering-on-chartqa?p=screenai-a-vision-language-model-for-ui-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/screenai-a-vision-language-model-for-ui-and/visual-question-answering-on-docvqa-test)](https://paperswithcode.com/sota/visual-question-answering-on-docvqa-test?p=screenai-a-vision-language-model-for-ui-and)`

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

7 Feb 2024 · Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma ·

Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

PDF Abstract

Code

Add Remove Mark official

google-research-datasets/screen_qa official

Tasks

Add Remove

Chart Question Answering

Language Modelling

Question Answering

Visual Question Answering (VQA)

Datasets

Introduced in the Paper:

ScreenQA Short

Used in the Paper:

DocVQA ChartQA

InfographicVQA

Screen2Words

MP-DocVQA OCR-VQA

Results from the Paper

Add Remove

Ranked #3 on Visual Question Answering (VQA) on InfographicVQA (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Chart Question Answering	ChartQA	ScreenAI 5B (4.62 B params, w/ OCR)	1:1 Accuracy	76.7	# 5	Compare
Visual Question Answering (VQA)	DocVQA test	ScreenAI 5B (4.62 B params, w/OCR)	ANLS	0.8988	# 5	Compare
Visual Question Answering (VQA)	InfographicVQA	ScreenAI 5B (4.62 B params, w/ OCR)	ANLS	65.90	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove