TASK |
DATASET |
MODEL |
METRIC NAME |
METRIC VALUE |
GLOBAL RANK |
REMOVE |
Natural Language Inference
|
ANLI test
|
BLOOM 176B (one-shot)
|
A1
|
33.6
|
# 16
|
|
Natural Language Inference
|
ANLI test
|
BLOOM 176B (one-shot)
|
A2
|
33.8
|
# 24
|
|
Natural Language Inference
|
ANLI test
|
BLOOM 176B (one-shot)
|
A3
|
35.17
|
# 23
|
|
Natural Language Inference
|
ANLI test
|
Bloomberg GPT (one-shot)
|
A1
|
32.9
|
# 18
|
|
Natural Language Inference
|
ANLI test
|
Bloomberg GPT (one-shot)
|
A2
|
34.4
|
# 21
|
|
Natural Language Inference
|
ANLI test
|
Bloomberg GPT (one-shot)
|
A3
|
37.33
|
# 21
|
|
Natural Language Inference
|
ANLI test
|
OPT 66B (one-shot)
|
A1
|
33.1
|
# 17
|
|
Natural Language Inference
|
ANLI test
|
OPT 66B (one-shot)
|
A2
|
34.2
|
# 22
|
|
Natural Language Inference
|
ANLI test
|
OPT 66B (one-shot)
|
A3
|
34.92
|
# 24
|
|
Natural Language Inference
|
ANLI test
|
GPT-NeoX (one-shot)
|
A1
|
32.6
|
# 19
|
|
Natural Language Inference
|
ANLI test
|
GPT-NeoX (one-shot)
|
A2
|
33.8
|
# 24
|
|
Natural Language Inference
|
ANLI test
|
GPT-NeoX (one-shot)
|
A3
|
36.17
|
# 22
|
|
Common Sense Reasoning
|
ARC (Challenge)
|
GPT-NeoX 20B (1-shot)
|
Accuracy
|
45.39
|
# 35
|
|
Common Sense Reasoning
|
ARC (Challenge)
|
OPT 66B (one-shot)
|
Accuracy
|
44.54
|
# 37
|
|
Common Sense Reasoning
|
ARC (Challenge)
|
Bloomberg GPT 50B (1-shot)
|
Accuracy
|
48.63
|
# 32
|
|
Common Sense Reasoning
|
ARC (Challenge)
|
BLOOM 176B (1-shot)
|
Accuracy
|
50.85
|
# 29
|
|
Common Sense Reasoning
|
ARC (Easy)
|
GPT-NeoX 20B (1-shot)
|
Accuracy
|
70.79
|
# 28
|
|
Common Sense Reasoning
|
ARC (Easy)
|
BLOOM 176B (1-shot)
|
Accuracy
|
75.93
|
# 18
|
|
Common Sense Reasoning
|
ARC (Easy)
|
Bloomberg GPT 50B (1-shot)
|
Accuracy
|
73.99
|
# 22
|
|
Common Sense Reasoning
|
ARC (Easy)
|
OPT 66B (1-shot)
|
Accuracy
|
71.25
|
# 25
|
|
Common Sense Reasoning
|
BIG-bench (Causal Judgment)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
61.0
|
# 2
|
|
Common Sense Reasoning
|
BIG-bench (Causal Judgment)
|
GPT-NeoX 20B (few-shot, k=3)
|
Accuracy
|
52.41
|
# 5
|
|
Common Sense Reasoning
|
BIG-bench (Causal Judgment)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
51.87
|
# 6
|
|
Common Sense Reasoning
|
BIG-bench (Causal Judgment)
|
BloombergGPT 50B (few-shot, k=3)
|
Accuracy
|
49.73
|
# 9
|
|
Common Sense Reasoning
|
BIG-bench (Causal Judgment)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
51.87
|
# 6
|
|
Common Sense Reasoning
|
BIG-bench (Date Understanding)
|
PaLM 540B (few-shot,k=3)
|
Accuracy
|
53.6
|
# 4
|
|
Common Sense Reasoning
|
BIG-bench (Date Understanding)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
49.60
|
# 7
|
|
Common Sense Reasoning
|
BIG-bench (Date Understanding)
|
Bloomberg GPT 50B (few-shot, k=3)
|
Accuracy
|
54.8
|
# 3
|
|
Common Sense Reasoning
|
BIG-bench (Date Understanding)
|
GPT-NeoX 20B (few-shot, k=3)
|
Accuracy
|
45.60
|
# 8
|
|
Common Sense Reasoning
|
BIG-bench (Date Understanding)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
50.00
|
# 6
|
|
Common Sense Reasoning
|
BIG-bench (Disambiguation QA)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
40.4
|
# 7
|
|
Common Sense Reasoning
|
BIG-bench (Disambiguation QA)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
40.4
|
# 7
|
|
Common Sense Reasoning
|
BIG-bench (Disambiguation QA)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
60.8
|
# 3
|
|
Common Sense Reasoning
|
BIG-bench (Disambiguation QA)
|
Bloomberg GPT 50B (few-shot, k=3)
|
Accuracy
|
34
|
# 9
|
|
Common Sense Reasoning
|
BIG-bench (Disambiguation QA)
|
GPT-NeoX 20B (few-shot, k=3)
|
Accuracy
|
40.8
|
# 6
|
|
Logical Reasoning
|
BIG-bench (Formal Fallacies Syllogisms Negation)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
53.6
|
# 4
|
|
Logical Reasoning
|
BIG-bench (Formal Fallacies Syllogisms Negation)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
52.8
|
# 5
|
|
Logical Reasoning
|
BIG-bench (Formal Fallacies Syllogisms Negation)
|
Bloomberg GPT 50B (few-shot, k=3)
|
Accuracy
|
50.8
|
# 8
|
|
Logical Reasoning
|
BIG-bench (Formal Fallacies Syllogisms Negation)
|
GPT-NeoX 20B (few-shot, k=3)
|
Accuracy
|
52.8
|
# 5
|
|
Logical Reasoning
|
BIG-bench (Formal Fallacies Syllogisms Negation)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
54
|
# 3
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Hyperbaton)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
70.8
|
# 7
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Hyperbaton)
|
GPT-NeoX (few-shot, k=3)
|
Accuracy
|
92
|
# 1
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Hyperbaton)
|
Bloomberg GPT (few-shot, k=3)
|
Accuracy
|
92
|
# 1
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Hyperbaton)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
91.6
|
# 4
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Hyperbaton)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
92
|
# 1
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Movie Recommendation)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
91.2
|
# 3
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Movie Recommendation)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
87.2
|
# 6
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Movie Recommendation)
|
Bloomberg GPT (few-shot, k=3)
|
Accuracy
|
90.4
|
# 5
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Movie Recommendation)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
91.2
|
# 3
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Movie Recommendation)
|
GPT-NeoX (few-shot, k=3)
|
Accuracy
|
86.4
|
# 7
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Navigate)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
42
|
# 8
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Navigate)
|
GPT-NeoX (few-shot, k=3)
|
Accuracy
|
45.2
|
# 7
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Navigate)
|
Bloomberg GPT (few-shot, k=3)
|
Accuracy
|
42
|
# 8
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Navigate)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
50
|
# 6
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Navigate)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
62.4
|
# 3
|
|
Logical Reasoning
|
BIG-bench (Penguins In A Table)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
44.5
|
# 4
|
|
Logical Reasoning
|
BIG-bench (Penguins In A Table)
|
Bloomberg GPT (few-shot, k=3)
|
Accuracy
|
37.67
|
# 7
|
|
Logical Reasoning
|
BIG-bench (Penguins In A Table)
|
GPT-NeoX (few-shot, k=3)
|
Accuracy
|
33.56
|
# 8
|
|
Logical Reasoning
|
BIG-bench (Penguins In A Table)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
28.08
|
# 9
|
|
Logical Reasoning
|
BIG-bench (Penguins In A Table)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
40.41
|
# 6
|
|
Logical Reasoning
|
BIG-bench (Reasoning About Colored Objects)
|
Bloomberg GPT (few-shot, k=3)
|
Accuracy
|
34.8
|
# 7
|
|
Logical Reasoning
|
BIG-bench (Reasoning About Colored Objects)
|
GPT-NeoX (few-shot, k=3)
|
Accuracy
|
26
|
# 9
|
|
Logical Reasoning
|
BIG-bench (Reasoning About Colored Objects)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
31.2
|
# 8
|
|
Logical Reasoning
|
BIG-bench (Reasoning About Colored Objects)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
36.8
|
# 6
|
|
Logical Reasoning
|
BIG-bench (Reasoning About Colored Objects)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
38
|
# 5
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Ruin Names)
|
Bloomberg GPT (few-shot, k=3)
|
Accuracy
|
56
|
# 4
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Ruin Names)
|
GPT-NeoX (few-shot, k=3)
|
Accuracy
|
54
|
# 6
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Ruin Names)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
52.8
|
# 7
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Ruin Names)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
54.8
|
# 5
|
|
Multiple Choice Question Answering (MCQA)
|
BIG-bench (Ruin Names)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
76
|
# 3
|
|
Sarcasm Detection
|
BIG-bench (SNARKS)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
72.47
|
# 4
|
|
Sarcasm Detection
|
BIG-bench (SNARKS)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
78.1
|
# 3
|
|
Sarcasm Detection
|
BIG-bench (SNARKS)
|
GPT-NeoX (few-shot, k=3)
|
Accuracy
|
62.36
|
# 6
|
|
Sarcasm Detection
|
BIG-bench (SNARKS)
|
Bloomberg GPT (few-shot, k=3)
|
Accuracy
|
69.66
|
# 5
|
|
Common Sense Reasoning
|
BIG-bench (Sports Understanding)
|
Bloomberg GPT (few-shot, k=3)
|
Accuracy
|
62.8
|
# 5
|
|
Common Sense Reasoning
|
BIG-bench (Sports Understanding)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
80.4
|
# 3
|
|
Common Sense Reasoning
|
BIG-bench (Sports Understanding)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
54.4
|
# 7
|
|
Common Sense Reasoning
|
BIG-bench (Sports Understanding)
|
GPT-NeoX (few-shot, k=3)
|
Accuracy
|
53.2
|
# 8
|
|
Logical Reasoning
|
BIG-bench (Temporal Sequences)
|
BLOOM 176B (few-shot, k=3)
|
Accuracy
|
36.8
|
# 4
|
|
Logical Reasoning
|
BIG-bench (Temporal Sequences)
|
PaLM 540B (few-shot, k=3)
|
Accuracy
|
39.6
|
# 3
|
|
Logical Reasoning
|
BIG-bench (Temporal Sequences)
|
Bloomberg GPT (few-shot, k=3)
|
Accuracy
|
29.2
|
# 6
|
|
Logical Reasoning
|
BIG-bench (Temporal Sequences)
|
GPT-NeoX (few-shot, k=3)
|
Accuracy
|
21.2
|
# 8
|
|
Logical Reasoning
|
BIG-bench (Temporal Sequences)
|
OPT 66B (few-shot, k=3)
|
Accuracy
|
23.6
|
# 7
|
|
Question Answering
|
BoolQ
|
OPT 66B (1-shot)
|
Accuracy
|
57.5
|
# 54
|
|
Question Answering
|
BoolQ
|
BLOOM 176B (1-shot)
|
Accuracy
|
52.9
|
# 57
|
|
Question Answering
|
BoolQ
|
GPT-NeoX 20B (1-shot)
|
Accuracy
|
46.4
|
# 59
|
|
Question Answering
|
BoolQ
|
Bloomberg GPT 50B (1-shot)
|
Accuracy
|
74.6
|
# 34
|
|
Natural Language Inference
|
CommitmentBank
|
GPT-NeoX (one-shot)
|
Accuracy
|
48.21
|
# 17
|
|
Natural Language Inference
|
CommitmentBank
|
Bloomberg GPT (one-shot)
|
Accuracy
|
53.57
|
# 16
|
|
Natural Language Inference
|
CommitmentBank
|
OPT 66B (one-shot)
|
Accuracy
|
44.64
|
# 19
|
|
Natural Language Inference
|
CommitmentBank
|
BLOOM 176B (one-shot)
|
Accuracy
|
48.21
|
# 17
|
|
Common Sense Reasoning
|
CommonsenseQA
|
OPT 66B (1-shot)
|
Accuracy
|
66.4
|
# 20
|
|
Common Sense Reasoning
|
CommonsenseQA
|
Bloomberg GPT 50B (1-shot)
|
Accuracy
|
65.5
|
# 21
|
|
Common Sense Reasoning
|
CommonsenseQA
|
GPT-NeoX 20B (1-shot)
|
Accuracy
|
60.4
|
# 27
|
|
Common Sense Reasoning
|
CommonsenseQA
|
BLOOM 176B (1-shot)
|
Accuracy
|
64.2
|
# 23
|
|
Question Answering
|
COPA
|
OPT 66B (one-shot)
|
Accuracy
|
86
|
# 25
|
|
Question Answering
|
COPA
|
BLOOM 176B (one-shot)
|
Accuracy
|
84
|
# 31
|
|
Question Answering
|
COPA
|
GPT-NeoX (one-shot)
|
Accuracy
|
88
|
# 21
|
|
Question Answering
|
COPA
|
Bloomberg GPT (one-shot)
|
Accuracy
|
86
|
# 25
|
|
Sentence Completion
|
HellaSwag
|
OPT 66B (1-shot)
|
Accuracy
|
73.5
|
# 49
|
|
Sentence Completion
|
HellaSwag
|
BLOOM 176B (1-shot)
|
Accuracy
|
73.2
|
# 50
|
|
Sentence Completion
|
HellaSwag
|
BlooombergGPT 50B (1-shot)
|
Accuracy
|
73.9
|
# 48
|
|
Sentence Completion
|
HellaSwag
|
GPT-NeoX 20B (1-shot)
|
Accuracy
|
68.4
|
# 52
|
|
Multi-task Language Understanding
|
MMLU
|
BLOOM 176B (5-shot)
|
Average (%)
|
39.1
|
# 81
|
|
Multi-task Language Understanding
|
MMLU
|
Bloomberg GPT 50B (5-shot)
|
Average (%)
|
39.2
|
# 79
|
|
Multi-task Language Understanding
|
MMLU
|
OPT 66B (5-shot)
|
Average (%)
|
36
|
# 84
|
|
Question Answering
|
MultiRC
|
BLOOM 176B (1-shot)
|
F1
|
26.7
|
# 23
|
|
Question Answering
|
MultiRC
|
Bloomberg GPT 50B (1-shot)
|
F1
|
62.3
|
# 18
|
|
Question Answering
|
MultiRC
|
GPT-NeoX 20B (1-shot)
|
F1
|
22.9
|
# 24
|
|
Question Answering
|
MultiRC
|
OPT 66B (1-shot)
|
F1
|
18.8
|
# 25
|
|
Question Answering
|
OpenBookQA
|
Bloomberg GPT 50B (1-shot)
|
Accuracy
|
51.6
|
# 32
|
|
Question Answering
|
OpenBookQA
|
GPT-NeoX 50B (2-shot)
|
Accuracy
|
44.2
|
# 34
|
|
Question Answering
|
OpenBookQA
|
OPT 66B (one-shot)
|
Accuracy
|
58.0
|
# 27
|
|
Question Answering
|
OpenBookQA
|
BLOOM 176B (2-shot)
|
Accuracy
|
47.2
|
# 33
|
|
Question Answering
|
PIQA
|
OPT 66B (1-shot)
|
Accuracy
|
77.6
|
# 35
|
|
Question Answering
|
PIQA
|
GPT-NeoX 20B (1-shot)
|
Accuracy
|
75.8
|
# 42
|
|
Question Answering
|
PIQA
|
Bloomberg GPT 50B (1-shot)
|
Accuracy
|
77.9
|
# 34
|
|
Question Answering
|
PIQA
|
BLOOM 176B (1-shot)
|
Accuracy
|
77
|
# 37
|
|
Reading Comprehension
|
RACE
|
OPT 66B (one-shot)
|
Accuracy (High)
|
37.02
|
# 17
|
|
Reading Comprehension
|
RACE
|
OPT 66B (one-shot)
|
Accuracy (Middle)
|
47.42
|
# 17
|
|
Reading Comprehension
|
RACE
|
GPT-NeoX (one-shot)
|
Accuracy (High)
|
34.33
|
# 18
|
|
Reading Comprehension
|
RACE
|
GPT-NeoX (one-shot)
|
Accuracy (Middle)
|
41.23
|
# 18
|
|
Reading Comprehension
|
RACE
|
Bloomberg GPT (one-shot)
|
Accuracy (High)
|
41.74
|
# 15
|
|
Reading Comprehension
|
RACE
|
Bloomberg GPT (one-shot)
|
Accuracy (Middle)
|
54.32
|
# 15
|
|
Reading Comprehension
|
RACE
|
BLOOM 176B (one-shot)
|
Accuracy (High)
|
39.14
|
# 16
|
|
Reading Comprehension
|
RACE
|
BLOOM 176B (one-shot)
|
Accuracy (Middle)
|
52.3
|
# 16
|
|
Common Sense Reasoning
|
ReCoRD
|
Bloomberg GPT 50B (1-shot)
|
F1
|
82.8
|
# 17
|
|
Common Sense Reasoning
|
ReCoRD
|
BLOOM 176B (1-shot)
|
F1
|
78
|
# 23
|
|
Common Sense Reasoning
|
ReCoRD
|
OPT 66B (1-shot)
|
F1
|
82.5
|
# 21
|
|
Common Sense Reasoning
|
ReCoRD
|
GPT-NeoX 20B (1-shot)
|
F1
|
67.9
|
# 28
|
|
Natural Language Inference
|
RTE
|
GPT-NeoX 20B (1-shot)
|
Accuracy
|
53.8%
|
# 86
|
|
Natural Language Inference
|
RTE
|
OPT 66B (1-shot)
|
Accuracy
|
54.9%
|
# 83
|
|
Natural Language Inference
|
RTE
|
BLOOM 176B (1-shot)
|
Accuracy
|
57.4%
|
# 80
|
|
Natural Language Inference
|
RTE
|
Bloomberg GPT 50B (1-shot)
|
Accuracy
|
69.3%
|
# 56
|
|
Common Sense Reasoning
|
WinoGrande
|
Bloomberg GPT (one-shot)
|
Accuracy
|
64.1
|
# 42
|
|
Common Sense Reasoning
|
WinoGrande
|
BLOOM 176B (1-shot)
|
Accuracy
|
67
|
# 38
|
|
Common Sense Reasoning
|
WinoGrande
|
OPT 66B (1-shot)
|
Accuracy
|
66.1
|
# 40
|
|
Common Sense Reasoning
|
WinoGrande
|
GPT-NeoX (one-shot)
|
Accuracy
|
60.6
|
# 46
|
|