TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Natural Language Inference	ANLI test	BLOOM 176B (one-shot)	A1	33.6	# 16
Natural Language Inference	ANLI test	BLOOM 176B (one-shot)	A2	33.8	# 24
Natural Language Inference	ANLI test	BLOOM 176B (one-shot)	A3	35.17	# 23
Natural Language Inference	ANLI test	Bloomberg GPT (one-shot)	A1	32.9	# 18
Natural Language Inference	ANLI test	Bloomberg GPT (one-shot)	A2	34.4	# 21
Natural Language Inference	ANLI test	Bloomberg GPT (one-shot)	A3	37.33	# 21
Natural Language Inference	ANLI test	OPT 66B (one-shot)	A1	33.1	# 17
Natural Language Inference	ANLI test	OPT 66B (one-shot)	A2	34.2	# 22
Natural Language Inference	ANLI test	OPT 66B (one-shot)	A3	34.92	# 24
Natural Language Inference	ANLI test	GPT-NeoX (one-shot)	A1	32.6	# 19
Natural Language Inference	ANLI test	GPT-NeoX (one-shot)	A2	33.8	# 24
Natural Language Inference	ANLI test	GPT-NeoX (one-shot)	A3	36.17	# 22
Common Sense Reasoning	ARC (Challenge)	GPT-NeoX 20B (1-shot)	Accuracy	45.39	# 35
Common Sense Reasoning	ARC (Challenge)	OPT 66B (one-shot)	Accuracy	44.54	# 37
Common Sense Reasoning	ARC (Challenge)	Bloomberg GPT 50B (1-shot)	Accuracy	48.63	# 32
Common Sense Reasoning	ARC (Challenge)	BLOOM 176B (1-shot)	Accuracy	50.85	# 29
Common Sense Reasoning	ARC (Easy)	GPT-NeoX 20B (1-shot)	Accuracy	70.79	# 28
Common Sense Reasoning	ARC (Easy)	BLOOM 176B (1-shot)	Accuracy	75.93	# 18
Common Sense Reasoning	ARC (Easy)	Bloomberg GPT 50B (1-shot)	Accuracy	73.99	# 22
Common Sense Reasoning	ARC (Easy)	OPT 66B (1-shot)	Accuracy	71.25	# 25
Common Sense Reasoning	BIG-bench (Causal Judgment)	PaLM 540B (few-shot, k=3)	Accuracy	61.0	# 2
Common Sense Reasoning	BIG-bench (Causal Judgment)	GPT-NeoX 20B (few-shot, k=3)	Accuracy	52.41	# 5
Common Sense Reasoning	BIG-bench (Causal Judgment)	OPT 66B (few-shot, k=3)	Accuracy	51.87	# 6
Common Sense Reasoning	BIG-bench (Causal Judgment)	BloombergGPT 50B (few-shot, k=3)	Accuracy	49.73	# 9
Common Sense Reasoning	BIG-bench (Causal Judgment)	BLOOM 176B (few-shot, k=3)	Accuracy	51.87	# 6
Common Sense Reasoning	BIG-bench (Date Understanding)	PaLM 540B (few-shot,k=3)	Accuracy	53.6	# 4
Common Sense Reasoning	BIG-bench (Date Understanding)	OPT 66B (few-shot, k=3)	Accuracy	49.60	# 7
Common Sense Reasoning	BIG-bench (Date Understanding)	Bloomberg GPT 50B (few-shot, k=3)	Accuracy	54.8	# 3
Common Sense Reasoning	BIG-bench (Date Understanding)	GPT-NeoX 20B (few-shot, k=3)	Accuracy	45.60	# 8
Common Sense Reasoning	BIG-bench (Date Understanding)	BLOOM 176B (few-shot, k=3)	Accuracy	50.00	# 6
Common Sense Reasoning	BIG-bench (Disambiguation QA)	BLOOM 176B (few-shot, k=3)	Accuracy	40.4	# 7
Common Sense Reasoning	BIG-bench (Disambiguation QA)	OPT 66B (few-shot, k=3)	Accuracy	40.4	# 7
Common Sense Reasoning	BIG-bench (Disambiguation QA)	PaLM 540B (few-shot, k=3)	Accuracy	60.8	# 3
Common Sense Reasoning	BIG-bench (Disambiguation QA)	Bloomberg GPT 50B (few-shot, k=3)	Accuracy	34	# 9
Common Sense Reasoning	BIG-bench (Disambiguation QA)	GPT-NeoX 20B (few-shot, k=3)	Accuracy	40.8	# 6
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	PaLM 540B (few-shot, k=3)	Accuracy	53.6	# 4
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	BLOOM 176B (few-shot, k=3)	Accuracy	52.8	# 5
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	Bloomberg GPT 50B (few-shot, k=3)	Accuracy	50.8	# 8
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	GPT-NeoX 20B (few-shot, k=3)	Accuracy	52.8	# 5
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	OPT 66B (few-shot, k=3)	Accuracy	54	# 3
Multiple Choice Question Answering (MCQA)	BIG-bench (Hyperbaton)	PaLM 540B (few-shot, k=3)	Accuracy	70.8	# 7
Multiple Choice Question Answering (MCQA)	BIG-bench (Hyperbaton)	GPT-NeoX (few-shot, k=3)	Accuracy	92	# 1
Multiple Choice Question Answering (MCQA)	BIG-bench (Hyperbaton)	Bloomberg GPT (few-shot, k=3)	Accuracy	92	# 1
Multiple Choice Question Answering (MCQA)	BIG-bench (Hyperbaton)	OPT 66B (few-shot, k=3)	Accuracy	91.6	# 4
Multiple Choice Question Answering (MCQA)	BIG-bench (Hyperbaton)	BLOOM 176B (few-shot, k=3)	Accuracy	92	# 1
Multiple Choice Question Answering (MCQA)	BIG-bench (Movie Recommendation)	OPT 66B (few-shot, k=3)	Accuracy	91.2	# 3
Multiple Choice Question Answering (MCQA)	BIG-bench (Movie Recommendation)	PaLM 540B (few-shot, k=3)	Accuracy	87.2	# 6
Multiple Choice Question Answering (MCQA)	BIG-bench (Movie Recommendation)	Bloomberg GPT (few-shot, k=3)	Accuracy	90.4	# 5
Multiple Choice Question Answering (MCQA)	BIG-bench (Movie Recommendation)	BLOOM 176B (few-shot, k=3)	Accuracy	91.2	# 3
Multiple Choice Question Answering (MCQA)	BIG-bench (Movie Recommendation)	GPT-NeoX (few-shot, k=3)	Accuracy	86.4	# 7
Multiple Choice Question Answering (MCQA)	BIG-bench (Navigate)	OPT 66B (few-shot, k=3)	Accuracy	42	# 8
Multiple Choice Question Answering (MCQA)	BIG-bench (Navigate)	GPT-NeoX (few-shot, k=3)	Accuracy	45.2	# 7
Multiple Choice Question Answering (MCQA)	BIG-bench (Navigate)	Bloomberg GPT (few-shot, k=3)	Accuracy	42	# 8
Multiple Choice Question Answering (MCQA)	BIG-bench (Navigate)	BLOOM 176B (few-shot, k=3)	Accuracy	50	# 6
Multiple Choice Question Answering (MCQA)	BIG-bench (Navigate)	PaLM 540B (few-shot, k=3)	Accuracy	62.4	# 3
Logical Reasoning	BIG-bench (Penguins In A Table)	PaLM 540B (few-shot, k=3)	Accuracy	44.5	# 4
Logical Reasoning	BIG-bench (Penguins In A Table)	Bloomberg GPT (few-shot, k=3)	Accuracy	37.67	# 7
Logical Reasoning	BIG-bench (Penguins In A Table)	GPT-NeoX (few-shot, k=3)	Accuracy	33.56	# 8
Logical Reasoning	BIG-bench (Penguins In A Table)	OPT 66B (few-shot, k=3)	Accuracy	28.08	# 9
Logical Reasoning	BIG-bench (Penguins In A Table)	BLOOM 176B (few-shot, k=3)	Accuracy	40.41	# 6
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	Bloomberg GPT (few-shot, k=3)	Accuracy	34.8	# 7
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	GPT-NeoX (few-shot, k=3)	Accuracy	26	# 9
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	OPT 66B (few-shot, k=3)	Accuracy	31.2	# 8
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	BLOOM 176B (few-shot, k=3)	Accuracy	36.8	# 6
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	PaLM 540B (few-shot, k=3)	Accuracy	38	# 5
Multiple Choice Question Answering (MCQA)	BIG-bench (Ruin Names)	Bloomberg GPT (few-shot, k=3)	Accuracy	56	# 4
Multiple Choice Question Answering (MCQA)	BIG-bench (Ruin Names)	GPT-NeoX (few-shot, k=3)	Accuracy	54	# 6
Multiple Choice Question Answering (MCQA)	BIG-bench (Ruin Names)	OPT 66B (few-shot, k=3)	Accuracy	52.8	# 7
Multiple Choice Question Answering (MCQA)	BIG-bench (Ruin Names)	BLOOM 176B (few-shot, k=3)	Accuracy	54.8	# 5
Multiple Choice Question Answering (MCQA)	BIG-bench (Ruin Names)	PaLM 540B (few-shot, k=3)	Accuracy	76	# 3
Sarcasm Detection	BIG-bench (SNARKS)	BLOOM 176B (few-shot, k=3)	Accuracy	72.47	# 4
Sarcasm Detection	BIG-bench (SNARKS)	PaLM 540B (few-shot, k=3)	Accuracy	78.1	# 3
Sarcasm Detection	BIG-bench (SNARKS)	GPT-NeoX (few-shot, k=3)	Accuracy	62.36	# 6
Sarcasm Detection	BIG-bench (SNARKS)	Bloomberg GPT (few-shot, k=3)	Accuracy	69.66	# 5
Common Sense Reasoning	BIG-bench (Sports Understanding)	Bloomberg GPT (few-shot, k=3)	Accuracy	62.8	# 5
Common Sense Reasoning	BIG-bench (Sports Understanding)	PaLM 540B (few-shot, k=3)	Accuracy	80.4	# 3
Common Sense Reasoning	BIG-bench (Sports Understanding)	OPT 66B (few-shot, k=3)	Accuracy	54.4	# 7
Common Sense Reasoning	BIG-bench (Sports Understanding)	GPT-NeoX (few-shot, k=3)	Accuracy	53.2	# 8
Logical Reasoning	BIG-bench (Temporal Sequences)	BLOOM 176B (few-shot, k=3)	Accuracy	36.8	# 4
Logical Reasoning	BIG-bench (Temporal Sequences)	PaLM 540B (few-shot, k=3)	Accuracy	39.6	# 3
Logical Reasoning	BIG-bench (Temporal Sequences)	Bloomberg GPT (few-shot, k=3)	Accuracy	29.2	# 6
Logical Reasoning	BIG-bench (Temporal Sequences)	GPT-NeoX (few-shot, k=3)	Accuracy	21.2	# 8
Logical Reasoning	BIG-bench (Temporal Sequences)	OPT 66B (few-shot, k=3)	Accuracy	23.6	# 7
Question Answering	BoolQ	OPT 66B (1-shot)	Accuracy	57.5	# 54
Question Answering	BoolQ	BLOOM 176B (1-shot)	Accuracy	52.9	# 57
Question Answering	BoolQ	GPT-NeoX 20B (1-shot)	Accuracy	46.4	# 59
Question Answering	BoolQ	Bloomberg GPT 50B (1-shot)	Accuracy	74.6	# 34
Natural Language Inference	CommitmentBank	GPT-NeoX (one-shot)	Accuracy	48.21	# 17
Natural Language Inference	CommitmentBank	Bloomberg GPT (one-shot)	Accuracy	53.57	# 16
Natural Language Inference	CommitmentBank	OPT 66B (one-shot)	Accuracy	44.64	# 19
Natural Language Inference	CommitmentBank	BLOOM 176B (one-shot)	Accuracy	48.21	# 17
Common Sense Reasoning	CommonsenseQA	OPT 66B (1-shot)	Accuracy	66.4	# 20
Common Sense Reasoning	CommonsenseQA	Bloomberg GPT 50B (1-shot)	Accuracy	65.5	# 21
Common Sense Reasoning	CommonsenseQA	GPT-NeoX 20B (1-shot)	Accuracy	60.4	# 27
Common Sense Reasoning	CommonsenseQA	BLOOM 176B (1-shot)	Accuracy	64.2	# 23
Question Answering	COPA	OPT 66B (one-shot)	Accuracy	86	# 25
Question Answering	COPA	BLOOM 176B (one-shot)	Accuracy	84	# 31
Question Answering	COPA	GPT-NeoX (one-shot)	Accuracy	88	# 21
Question Answering	COPA	Bloomberg GPT (one-shot)	Accuracy	86	# 25
Sentence Completion	HellaSwag	OPT 66B (1-shot)	Accuracy	73.5	# 49
Sentence Completion	HellaSwag	BLOOM 176B (1-shot)	Accuracy	73.2	# 50
Sentence Completion	HellaSwag	BlooombergGPT 50B (1-shot)	Accuracy	73.9	# 48
Sentence Completion	HellaSwag	GPT-NeoX 20B (1-shot)	Accuracy	68.4	# 52
Multi-task Language Understanding	MMLU	BLOOM 176B (5-shot)	Average (%)	39.1	# 81
Multi-task Language Understanding	MMLU	Bloomberg GPT 50B (5-shot)	Average (%)	39.2	# 79
Multi-task Language Understanding	MMLU	OPT 66B (5-shot)	Average (%)	36	# 84
Question Answering	MultiRC	BLOOM 176B (1-shot)	F1	26.7	# 23
Question Answering	MultiRC	Bloomberg GPT 50B (1-shot)	F1	62.3	# 18
Question Answering	MultiRC	GPT-NeoX 20B (1-shot)	F1	22.9	# 24
Question Answering	MultiRC	OPT 66B (1-shot)	F1	18.8	# 25
Question Answering	OpenBookQA	Bloomberg GPT 50B (1-shot)	Accuracy	51.6	# 32
Question Answering	OpenBookQA	GPT-NeoX 50B (2-shot)	Accuracy	44.2	# 34
Question Answering	OpenBookQA	OPT 66B (one-shot)	Accuracy	58.0	# 27
Question Answering	OpenBookQA	BLOOM 176B (2-shot)	Accuracy	47.2	# 33
Question Answering	PIQA	OPT 66B (1-shot)	Accuracy	77.6	# 35
Question Answering	PIQA	GPT-NeoX 20B (1-shot)	Accuracy	75.8	# 42
Question Answering	PIQA	Bloomberg GPT 50B (1-shot)	Accuracy	77.9	# 34
Question Answering	PIQA	BLOOM 176B (1-shot)	Accuracy	77	# 37
Reading Comprehension	RACE	OPT 66B (one-shot)	Accuracy (High)	37.02	# 17
Reading Comprehension	RACE	OPT 66B (one-shot)	Accuracy (Middle)	47.42	# 17
Reading Comprehension	RACE	GPT-NeoX (one-shot)	Accuracy (High)	34.33	# 18
Reading Comprehension	RACE	GPT-NeoX (one-shot)	Accuracy (Middle)	41.23	# 18
Reading Comprehension	RACE	Bloomberg GPT (one-shot)	Accuracy (High)	41.74	# 15
Reading Comprehension	RACE	Bloomberg GPT (one-shot)	Accuracy (Middle)	54.32	# 15
Reading Comprehension	RACE	BLOOM 176B (one-shot)	Accuracy (High)	39.14	# 16
Reading Comprehension	RACE	BLOOM 176B (one-shot)	Accuracy (Middle)	52.3	# 16
Common Sense Reasoning	ReCoRD	Bloomberg GPT 50B (1-shot)	F1	82.8	# 17
Common Sense Reasoning	ReCoRD	BLOOM 176B (1-shot)	F1	78	# 23
Common Sense Reasoning	ReCoRD	OPT 66B (1-shot)	F1	82.5	# 21
Common Sense Reasoning	ReCoRD	GPT-NeoX 20B (1-shot)	F1	67.9	# 28
Natural Language Inference	RTE	GPT-NeoX 20B (1-shot)	Accuracy	53.8%	# 86
Natural Language Inference	RTE	OPT 66B (1-shot)	Accuracy	54.9%	# 83
Natural Language Inference	RTE	BLOOM 176B (1-shot)	Accuracy	57.4%	# 80
Natural Language Inference	RTE	Bloomberg GPT 50B (1-shot)	Accuracy	69.3%	# 56
Common Sense Reasoning	WinoGrande	Bloomberg GPT (one-shot)	Accuracy	64.1	# 42
Common Sense Reasoning	WinoGrande	BLOOM 176B (1-shot)	Accuracy	67	# 38
Common Sense Reasoning	WinoGrande	OPT 66B (1-shot)	Accuracy	66.1	# 40
Common Sense Reasoning	WinoGrande	GPT-NeoX (one-shot)	Accuracy	60.6	# 46

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/multiple-choice-question-answering-mcqa-on-27)](https://paperswithcode.com/sota/multiple-choice-question-answering-mcqa-on-27?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/common-sense-reasoning-on-big-bench-causal)](https://paperswithcode.com/sota/common-sense-reasoning-on-big-bench-causal?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/common-sense-reasoning-on-big-bench-date)](https://paperswithcode.com/sota/common-sense-reasoning-on-big-bench-date?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/common-sense-reasoning-on-big-bench)](https://paperswithcode.com/sota/common-sense-reasoning-on-big-bench?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/logical-reasoning-on-big-bench-formal)](https://paperswithcode.com/sota/logical-reasoning-on-big-bench-formal?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/multiple-choice-question-answering-mcqa-on-28)](https://paperswithcode.com/sota/multiple-choice-question-answering-mcqa-on-28?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/multiple-choice-question-answering-mcqa-on-29)](https://paperswithcode.com/sota/multiple-choice-question-answering-mcqa-on-29?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/multiple-choice-question-answering-mcqa-on-30)](https://paperswithcode.com/sota/multiple-choice-question-answering-mcqa-on-30?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/sarcasm-detection-on-big-bench-snarks)](https://paperswithcode.com/sota/sarcasm-detection-on-big-bench-snarks?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/common-sense-reasoning-on-big-bench-sports)](https://paperswithcode.com/sota/common-sense-reasoning-on-big-bench-sports?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/logical-reasoning-on-big-bench-temporal)](https://paperswithcode.com/sota/logical-reasoning-on-big-bench-temporal?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/logical-reasoning-on-big-bench-penguins-in-a)](https://paperswithcode.com/sota/logical-reasoning-on-big-bench-penguins-in-a?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/logical-reasoning-on-big-bench-reasoning)](https://paperswithcode.com/sota/logical-reasoning-on-big-bench-reasoning?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/reading-comprehension-on-race)](https://paperswithcode.com/sota/reading-comprehension-on-race?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/natural-language-inference-on-anli-test)](https://paperswithcode.com/sota/natural-language-inference-on-anli-test?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/natural-language-inference-on-commitmentbank)](https://paperswithcode.com/sota/natural-language-inference-on-commitmentbank?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/common-sense-reasoning-on-record)](https://paperswithcode.com/sota/common-sense-reasoning-on-record?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/common-sense-reasoning-on-arc-easy)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-easy?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/question-answering-on-multirc)](https://paperswithcode.com/sota/question-answering-on-multirc?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/common-sense-reasoning-on-commonsenseqa)](https://paperswithcode.com/sota/common-sense-reasoning-on-commonsenseqa?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/question-answering-on-copa)](https://paperswithcode.com/sota/question-answering-on-copa?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/question-answering-on-openbookqa)](https://paperswithcode.com/sota/question-answering-on-openbookqa?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/common-sense-reasoning-on-arc-challenge)](https://paperswithcode.com/sota/common-sense-reasoning-on-arc-challenge?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/question-answering-on-boolq)](https://paperswithcode.com/sota/question-answering-on-boolq?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/question-answering-on-piqa)](https://paperswithcode.com/sota/question-answering-on-piqa?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/common-sense-reasoning-on-winogrande)](https://paperswithcode.com/sota/common-sense-reasoning-on-winogrande?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/sentence-completion-on-hellaswag)](https://paperswithcode.com/sota/sentence-completion-on-hellaswag?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/natural-language-inference-on-rte)](https://paperswithcode.com/sota/natural-language-inference-on-rte?p=bloomberggpt-a-large-language-model-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/bloomberggpt-a-large-language-model-for/multi-task-language-understanding-on-mmlu)](https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu?p=bloomberggpt-a-large-language-model-for)`

BloombergGPT: A Large Language Model for Finance

30 Mar 2023 · Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann ·

The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. We release Training Chronicles (Appendix C) detailing our experience in training BloombergGPT.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Causal Judgment

Common Sense Reasoning

Date Understanding

Disambiguation QA

Formal Fallacies Syllogisms Negation

Hyperbaton

Language Modelling

Large Language Model

Logical Reasoning

Movie Recommendation

Multiple Choice Question Answering (MCQA)

Multi-task Language Understanding

named-entity-recognition

Named Entity Recognition

Natural Language Inference

Navigate

Penguins In A Table

Question Answering

Reading Comprehension

Reasoning About Colored Objects

Ruin Names

Sarcasm Detection

Sentence Completion

Sentiment Analysis

SNARKS

Sports Understanding

Temporal Sequences

Datasets

GLUE

MMLU

HellaSwag

BoolQ

PIQA

RACE

OpenBookQA

WinoGrande

CommonsenseQA

WebText

The Pile

COPA

ANLI

BIG-bench WiC BBH

MultiRC HELM

ReCoRD

ARC (AI2 Reasoning Challenge)

FinQA RTE FIN

ConvFinQA CommitmentBank

Results from the Paper

Edit

Ranked #1 on Multiple Choice Question Answering (MCQA) on BIG-bench (Hyperbaton)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Natural Language Inference	ANLI test	BLOOM 176B (one-shot)	A1	33.6	# 16	Compare
			A2	33.8	# 24	Compare
			A3	35.17	# 23	Compare
Natural Language Inference	ANLI test	Bloomberg GPT (one-shot)	A1	32.9	# 18	Compare
			A2	34.4	# 21	Compare
			A3	37.33	# 21	Compare
Natural Language Inference	ANLI test	OPT 66B (one-shot)	A1	33.1	# 17	Compare
			A2	34.2	# 22	Compare
			A3	34.92	# 24	Compare
Natural Language Inference	ANLI test	GPT-NeoX (one-shot)	A1	32.6	# 19	Compare
			A2	33.8	# 24	Compare
			A3	36.17	# 22	Compare
Common Sense Reasoning	ARC (Challenge)	GPT-NeoX 20B (1-shot)	Accuracy	45.39	# 35	Compare
Common Sense Reasoning	ARC (Challenge)	OPT 66B (one-shot)	Accuracy	44.54	# 37	Compare
Common Sense Reasoning	ARC (Challenge)	Bloomberg GPT 50B (1-shot)	Accuracy	48.63	# 32	Compare
Common Sense Reasoning	ARC (Challenge)	BLOOM 176B (1-shot)	Accuracy	50.85	# 29	Compare
Common Sense Reasoning	ARC (Easy)	GPT-NeoX 20B (1-shot)	Accuracy	70.79	# 28	Compare
Common Sense Reasoning	ARC (Easy)	BLOOM 176B (1-shot)	Accuracy	75.93	# 18	Compare
Common Sense Reasoning	ARC (Easy)	Bloomberg GPT 50B (1-shot)	Accuracy	73.99	# 22	Compare
Common Sense Reasoning	ARC (Easy)	OPT 66B (1-shot)	Accuracy	71.25	# 25	Compare
Common Sense Reasoning	BIG-bench (Causal Judgment)	PaLM 540B (few-shot, k=3)	Accuracy	61.0	# 2	Compare
Common Sense Reasoning	BIG-bench (Causal Judgment)	GPT-NeoX 20B (few-shot, k=3)	Accuracy	52.41	# 5	Compare
Common Sense Reasoning	BIG-bench (Causal Judgment)	OPT 66B (few-shot, k=3)	Accuracy	51.87	# 6	Compare
Common Sense Reasoning	BIG-bench (Causal Judgment)	BloombergGPT 50B (few-shot, k=3)	Accuracy	49.73	# 9	Compare
Common Sense Reasoning	BIG-bench (Causal Judgment)	BLOOM 176B (few-shot, k=3)	Accuracy	51.87	# 6	Compare
Common Sense Reasoning	BIG-bench (Date Understanding)	PaLM 540B (few-shot,k=3)	Accuracy	53.6	# 4	Compare
Common Sense Reasoning	BIG-bench (Date Understanding)	OPT 66B (few-shot, k=3)	Accuracy	49.60	# 7	Compare
Common Sense Reasoning	BIG-bench (Date Understanding)	Bloomberg GPT 50B (few-shot, k=3)	Accuracy	54.8	# 3	Compare
Common Sense Reasoning	BIG-bench (Date Understanding)	GPT-NeoX 20B (few-shot, k=3)	Accuracy	45.60	# 8	Compare
Common Sense Reasoning	BIG-bench (Date Understanding)	BLOOM 176B (few-shot, k=3)	Accuracy	50.00	# 6	Compare
Common Sense Reasoning	BIG-bench (Disambiguation QA)	BLOOM 176B (few-shot, k=3)	Accuracy	40.4	# 7	Compare
Common Sense Reasoning	BIG-bench (Disambiguation QA)	OPT 66B (few-shot, k=3)	Accuracy	40.4	# 7	Compare
Common Sense Reasoning	BIG-bench (Disambiguation QA)	PaLM 540B (few-shot, k=3)	Accuracy	60.8	# 3	Compare
Common Sense Reasoning	BIG-bench (Disambiguation QA)	Bloomberg GPT 50B (few-shot, k=3)	Accuracy	34	# 9	Compare
Common Sense Reasoning	BIG-bench (Disambiguation QA)	GPT-NeoX 20B (few-shot, k=3)	Accuracy	40.8	# 6	Compare
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	PaLM 540B (few-shot, k=3)	Accuracy	53.6	# 4	Compare
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	BLOOM 176B (few-shot, k=3)	Accuracy	52.8	# 5	Compare
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	Bloomberg GPT 50B (few-shot, k=3)	Accuracy	50.8	# 8	Compare
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	GPT-NeoX 20B (few-shot, k=3)	Accuracy	52.8	# 5	Compare
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	OPT 66B (few-shot, k=3)	Accuracy	54	# 3	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Hyperbaton)	PaLM 540B (few-shot, k=3)	Accuracy	70.8	# 7	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Hyperbaton)	GPT-NeoX (few-shot, k=3)	Accuracy	92	# 1	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Hyperbaton)	Bloomberg GPT (few-shot, k=3)	Accuracy	92	# 1	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Hyperbaton)	OPT 66B (few-shot, k=3)	Accuracy	91.6	# 4	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Hyperbaton)	BLOOM 176B (few-shot, k=3)	Accuracy	92	# 1	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Movie Recommendation)	OPT 66B (few-shot, k=3)	Accuracy	91.2	# 3	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Movie Recommendation)	PaLM 540B (few-shot, k=3)	Accuracy	87.2	# 6	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Movie Recommendation)	Bloomberg GPT (few-shot, k=3)	Accuracy	90.4	# 5	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Movie Recommendation)	BLOOM 176B (few-shot, k=3)	Accuracy	91.2	# 3	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Movie Recommendation)	GPT-NeoX (few-shot, k=3)	Accuracy	86.4	# 7	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Navigate)	OPT 66B (few-shot, k=3)	Accuracy	42	# 8	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Navigate)	GPT-NeoX (few-shot, k=3)	Accuracy	45.2	# 7	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Navigate)	Bloomberg GPT (few-shot, k=3)	Accuracy	42	# 8	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Navigate)	BLOOM 176B (few-shot, k=3)	Accuracy	50	# 6	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Navigate)	PaLM 540B (few-shot, k=3)	Accuracy	62.4	# 3	Compare
Logical Reasoning	BIG-bench (Penguins In A Table)	PaLM 540B (few-shot, k=3)	Accuracy	44.5	# 4	Compare
Logical Reasoning	BIG-bench (Penguins In A Table)	Bloomberg GPT (few-shot, k=3)	Accuracy	37.67	# 7	Compare
Logical Reasoning	BIG-bench (Penguins In A Table)	GPT-NeoX (few-shot, k=3)	Accuracy	33.56	# 8	Compare
Logical Reasoning	BIG-bench (Penguins In A Table)	OPT 66B (few-shot, k=3)	Accuracy	28.08	# 9	Compare
Logical Reasoning	BIG-bench (Penguins In A Table)	BLOOM 176B (few-shot, k=3)	Accuracy	40.41	# 6	Compare
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	Bloomberg GPT (few-shot, k=3)	Accuracy	34.8	# 7	Compare
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	GPT-NeoX (few-shot, k=3)	Accuracy	26	# 9	Compare
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	OPT 66B (few-shot, k=3)	Accuracy	31.2	# 8	Compare
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	BLOOM 176B (few-shot, k=3)	Accuracy	36.8	# 6	Compare
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	PaLM 540B (few-shot, k=3)	Accuracy	38	# 5	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Ruin Names)	Bloomberg GPT (few-shot, k=3)	Accuracy	56	# 4	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Ruin Names)	GPT-NeoX (few-shot, k=3)	Accuracy	54	# 6	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Ruin Names)	OPT 66B (few-shot, k=3)	Accuracy	52.8	# 7	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Ruin Names)	BLOOM 176B (few-shot, k=3)	Accuracy	54.8	# 5	Compare
Multiple Choice Question Answering (MCQA)	BIG-bench (Ruin Names)	PaLM 540B (few-shot, k=3)	Accuracy	76	# 3	Compare
Sarcasm Detection	BIG-bench (SNARKS)	BLOOM 176B (few-shot, k=3)	Accuracy	72.47	# 4	Compare
Sarcasm Detection	BIG-bench (SNARKS)	PaLM 540B (few-shot, k=3)	Accuracy	78.1	# 3	Compare
Sarcasm Detection	BIG-bench (SNARKS)	GPT-NeoX (few-shot, k=3)	Accuracy	62.36	# 6	Compare
Sarcasm Detection	BIG-bench (SNARKS)	Bloomberg GPT (few-shot, k=3)	Accuracy	69.66	# 5	Compare
Common Sense Reasoning	BIG-bench (Sports Understanding)	Bloomberg GPT (few-shot, k=3)	Accuracy	62.8	# 5	Compare
Common Sense Reasoning	BIG-bench (Sports Understanding)	PaLM 540B (few-shot, k=3)	Accuracy	80.4	# 3	Compare
Common Sense Reasoning	BIG-bench (Sports Understanding)	OPT 66B (few-shot, k=3)	Accuracy	54.4	# 7	Compare
Common Sense Reasoning	BIG-bench (Sports Understanding)	GPT-NeoX (few-shot, k=3)	Accuracy	53.2	# 8	Compare
Logical Reasoning	BIG-bench (Temporal Sequences)	BLOOM 176B (few-shot, k=3)	Accuracy	36.8	# 4	Compare
Logical Reasoning	BIG-bench (Temporal Sequences)	PaLM 540B (few-shot, k=3)	Accuracy	39.6	# 3	Compare
Logical Reasoning	BIG-bench (Temporal Sequences)	Bloomberg GPT (few-shot, k=3)	Accuracy	29.2	# 6	Compare
Logical Reasoning	BIG-bench (Temporal Sequences)	GPT-NeoX (few-shot, k=3)	Accuracy	21.2	# 8	Compare
Logical Reasoning	BIG-bench (Temporal Sequences)	OPT 66B (few-shot, k=3)	Accuracy	23.6	# 7	Compare
Question Answering	BoolQ	OPT 66B (1-shot)	Accuracy	57.5	# 54	Compare
Question Answering	BoolQ	BLOOM 176B (1-shot)	Accuracy	52.9	# 57	Compare
Question Answering	BoolQ	GPT-NeoX 20B (1-shot)	Accuracy	46.4	# 59	Compare
Question Answering	BoolQ	Bloomberg GPT 50B (1-shot)	Accuracy	74.6	# 34	Compare
Natural Language Inference	CommitmentBank	GPT-NeoX (one-shot)	Accuracy	48.21	# 17	Compare
Natural Language Inference	CommitmentBank	Bloomberg GPT (one-shot)	Accuracy	53.57	# 16	Compare
Natural Language Inference	CommitmentBank	OPT 66B (one-shot)	Accuracy	44.64	# 19	Compare
Natural Language Inference	CommitmentBank	BLOOM 176B (one-shot)	Accuracy	48.21	# 17	Compare
Common Sense Reasoning	CommonsenseQA	OPT 66B (1-shot)	Accuracy	66.4	# 20	Compare
Common Sense Reasoning	CommonsenseQA	Bloomberg GPT 50B (1-shot)	Accuracy	65.5	# 21	Compare
Common Sense Reasoning	CommonsenseQA	GPT-NeoX 20B (1-shot)	Accuracy	60.4	# 27	Compare
Common Sense Reasoning	CommonsenseQA	BLOOM 176B (1-shot)	Accuracy	64.2	# 23	Compare
Question Answering	COPA	OPT 66B (one-shot)	Accuracy	86	# 25	Compare
Question Answering	COPA	BLOOM 176B (one-shot)	Accuracy	84	# 31	Compare
Question Answering	COPA	GPT-NeoX (one-shot)	Accuracy	88	# 21	Compare
Question Answering	COPA	Bloomberg GPT (one-shot)	Accuracy	86	# 25	Compare
Sentence Completion	HellaSwag	OPT 66B (1-shot)	Accuracy	73.5	# 49	Compare
Sentence Completion	HellaSwag	BLOOM 176B (1-shot)	Accuracy	73.2	# 50	Compare
Sentence Completion	HellaSwag	BlooombergGPT 50B (1-shot)	Accuracy	73.9	# 48	Compare
Sentence Completion	HellaSwag	GPT-NeoX 20B (1-shot)	Accuracy	68.4	# 52	Compare
Multi-task Language Understanding	MMLU	BLOOM 176B (5-shot)	Average (%)	39.1	# 81	Compare
Multi-task Language Understanding	MMLU	Bloomberg GPT 50B (5-shot)	Average (%)	39.2	# 79	Compare
Multi-task Language Understanding	MMLU	OPT 66B (5-shot)	Average (%)	36	# 84	Compare
Question Answering	MultiRC	BLOOM 176B (1-shot)	F1	26.7	# 23	Compare
Question Answering	MultiRC	Bloomberg GPT 50B (1-shot)	F1	62.3	# 18	Compare
Question Answering	MultiRC	GPT-NeoX 20B (1-shot)	F1	22.9	# 24	Compare
Question Answering	MultiRC	OPT 66B (1-shot)	F1	18.8	# 25	Compare
Question Answering	OpenBookQA	Bloomberg GPT 50B (1-shot)	Accuracy	51.6	# 32	Compare
Question Answering	OpenBookQA	GPT-NeoX 50B (2-shot)	Accuracy	44.2	# 34	Compare
Question Answering	OpenBookQA	OPT 66B (one-shot)	Accuracy	58.0	# 27	Compare
Question Answering	OpenBookQA	BLOOM 176B (2-shot)	Accuracy	47.2	# 33	Compare
Question Answering	PIQA	OPT 66B (1-shot)	Accuracy	77.6	# 35	Compare
Question Answering	PIQA	GPT-NeoX 20B (1-shot)	Accuracy	75.8	# 42	Compare
Question Answering	PIQA	Bloomberg GPT 50B (1-shot)	Accuracy	77.9	# 34	Compare
Question Answering	PIQA	BLOOM 176B (1-shot)	Accuracy	77	# 37	Compare
Reading Comprehension	RACE	OPT 66B (one-shot)	Accuracy (High)	37.02	# 17	Compare
Reading Comprehension	RACE	OPT 66B (one-shot)	Accuracy (Middle)	47.42	# 17	Compare
Reading Comprehension	RACE	GPT-NeoX (one-shot)	Accuracy (High)	34.33	# 18	Compare
Reading Comprehension	RACE	GPT-NeoX (one-shot)	Accuracy (Middle)	41.23	# 18	Compare
Reading Comprehension	RACE	Bloomberg GPT (one-shot)	Accuracy (High)	41.74	# 15	Compare
Reading Comprehension	RACE	Bloomberg GPT (one-shot)	Accuracy (Middle)	54.32	# 15	Compare
Reading Comprehension	RACE	BLOOM 176B (one-shot)	Accuracy (High)	39.14	# 16	Compare
Reading Comprehension	RACE	BLOOM 176B (one-shot)	Accuracy (Middle)	52.3	# 16	Compare
Common Sense Reasoning	ReCoRD	Bloomberg GPT 50B (1-shot)	F1	82.8	# 17	Compare
Common Sense Reasoning	ReCoRD	BLOOM 176B (1-shot)	F1	78	# 23	Compare
Common Sense Reasoning	ReCoRD	OPT 66B (1-shot)	F1	82.5	# 21	Compare
Common Sense Reasoning	ReCoRD	GPT-NeoX 20B (1-shot)	F1	67.9	# 28	Compare
Natural Language Inference	RTE	GPT-NeoX 20B (1-shot)	Accuracy	53.8%	# 86	Compare
Natural Language Inference	RTE	OPT 66B (1-shot)	Accuracy	54.9%	# 83	Compare
Natural Language Inference	RTE	BLOOM 176B (1-shot)	Accuracy	57.4%	# 80	Compare
Natural Language Inference	RTE	Bloomberg GPT 50B (1-shot)	Accuracy	69.3%	# 56	Compare
Common Sense Reasoning	WinoGrande	Bloomberg GPT (one-shot)	Accuracy	64.1	# 42	Compare
Common Sense Reasoning	WinoGrande	BLOOM 176B (1-shot)	Accuracy	67	# 38	Compare
Common Sense Reasoning	WinoGrande	OPT 66B (1-shot)	Accuracy	66.1	# 40	Compare
Common Sense Reasoning	WinoGrande	GPT-NeoX (one-shot)	Accuracy	60.6	# 46	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

BloombergGPT: A Large Language Model for Finance

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove