TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE
Visual Question Answering	ViP-Bench	ViP-LLaVA-13B (Visual Prompt)	GPT-4 score (bbox)	48.3	# 3
Visual Question Answering	ViP-Bench	ViP-LLaVA-13B (Visual Prompt)	GPT-4 score (human)	48.2	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/making-large-language-models-better-data/visual-question-answering-on-vip-bench)](https://paperswithcode.com/sota/visual-question-answering-on-vip-bench?p=making-large-language-models-better-data)`

Making Large Language Models Better Data Creators

31 Oct 2023 · Dong-Ho Lee, Jay Pujara, Mohit Sewak, Ryen W. White, Sujay Kumar Jauhar ·

Although large language models (LLMs) have advanced the state-of-the-art in NLP significantly, deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. As such, trainable models are still the preferred option in some cases. However, these models still require human-labeled data for optimal performance, which is expensive and time-consuming to obtain. In order to address this issue, several techniques to reduce human effort involve labeling or generating data using LLMs. Although these methods are effective for certain applications, in practice they encounter difficulties in real-world scenarios. Labeling data requires careful data selection, while generating data necessitates task-specific prompt engineering. In this paper, we propose a unified data creation pipeline that requires only a single formatting example, and which is applicable to a broad range of tasks, including traditionally problematic ones with semantically devoid label spaces. In our experiments we demonstrate that instruction-following LLMs are highly cost-effective data creators, and that models trained with these data exhibit performance better than those trained with human-labeled data (by up to 17.5%) on out-of-distribution evaluation, while maintaining comparable performance on in-distribution tasks. These results have important implications for the robustness of NLP systems deployed in the real-world.

PDF Abstract

Code

Add Remove Mark official

microsoft/llm-data-creation official

Tasks

Add Remove

Instruction Following

Prompt Engineering

Visual Question Answering

Datasets

BoolQ

PIQA

WinoGrande

CommonsenseQA

StrategyQA

BioASQ

PubMedQA

CREAK RiddleSense

ViP-Bench

Results from the Paper

Add Remove

Ranked #3 on Visual Question Answering on ViP-Bench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Visual Question Answering	ViP-Bench	ViP-LLaVA-13B (Visual Prompt)	GPT-4 score (bbox)	48.3	# 3	Compare
Visual Question Answering	ViP-Bench	ViP-LLaVA-13B (Visual Prompt)	GPT-4 score (human)	48.2	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

Making Large Language Models Better Data Creators

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove