TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Language Modelling	enwik8	SRU++ Large	Bit per Character (BPC)	0.95	# 4
Language Modelling	enwik8	SRU++ Large	Number of params	195M	# 9
Language Modelling	enwik8	SRU++ Base	Bit per Character (BPC)	0.97	# 8
Language Modelling	enwik8	SRU++ Base	Number of params	108M	# 12
Language Modelling	One Billion Word	SRU++ Large	PPL	23.5	# 6
Language Modelling	One Billion Word	SRU++ Large	Number of params	465M	# 25
Language Modelling	One Billion Word	SRU++	PPL	25.1	# 12
Language Modelling	One Billion Word	SRU++	Number of params	328M	# 24
Language Modelling	WikiText-103	SRU++ Base	Validation perplexity	17.5	# 13
Language Modelling	WikiText-103	SRU++ Base	Test perplexity	18.3	# 33
Language Modelling	WikiText-103	SRU++ Base	Number of params	148M	# 31
Language Modelling	WikiText-103	SRU++ Large	Validation perplexity	16.4	# 8
Language Modelling	WikiText-103	SRU++ Large	Test perplexity	17.1	# 20
Language Modelling	WikiText-103	SRU++ Large	Number of params	234M	# 27

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/when-attention-meets-fast-recurrence-training/language-modelling-on-enwiki8)](https://paperswithcode.com/sota/language-modelling-on-enwiki8?p=when-attention-meets-fast-recurrence-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/when-attention-meets-fast-recurrence-training/language-modelling-on-one-billion-word)](https://paperswithcode.com/sota/language-modelling-on-one-billion-word?p=when-attention-meets-fast-recurrence-training)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/when-attention-meets-fast-recurrence-training/language-modelling-on-wikitext-103)](https://paperswithcode.com/sota/language-modelling-on-wikitext-103?p=when-attention-meets-fast-recurrence-training)`

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

EMNLP 2021 · Tao Lei ·

Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard language modeling tasks such as Enwik8, Wiki-103 and Billion Word datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost compared to top-performing Transformer models. For instance, our model achieves a state-of-the-art result on the Enwik8 dataset using 1.6 days of training on an 8-GPU machine. We further demonstrate that SRU++ requires minimal attention for near state-of-the-art performance. Our results suggest jointly leveraging fast recurrence with little attention as a promising direction for accelerating model training and inference.

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract

Code

Add Remove Mark official

asappresearch/sru official

2,098

Tasks

Add Remove

Language Modelling

Machine Translation

Datasets

WikiText-2

WikiText-103 Billion Word Benchmark

Results from the Paper

Edit

Ranked #4 on Language Modelling on enwik8

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Language Modelling	enwik8	SRU++ Large	Bit per Character (BPC)	0.95	# 4	Compare
Language Modelling	enwik8	SRU++ Large	Number of params	195M	# 9	Compare
Language Modelling	enwik8	SRU++ Base	Bit per Character (BPC)	0.97	# 8	Compare
Language Modelling	enwik8	SRU++ Base	Number of params	108M	# 12	Compare
Language Modelling	One Billion Word	SRU++ Large	PPL	23.5	# 6	Compare
Language Modelling	One Billion Word	SRU++ Large	Number of params	465M	# 25	Compare
Language Modelling	One Billion Word	SRU++	PPL	25.1	# 12	Compare
Language Modelling	One Billion Word	SRU++	Number of params	328M	# 24	Compare
Language Modelling	WikiText-103	SRU++ Base	Validation perplexity	17.5	# 13	Compare
			Test perplexity	18.3	# 33	Compare
			Number of params	148M	# 31	Compare
Language Modelling	WikiText-103	SRU++ Large	Validation perplexity	16.4	# 8	Compare
			Test perplexity	17.1	# 20	Compare
			Number of params	234M	# 27	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • SRU++ • Transformer

Edit Social Preview

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove