When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

EMNLP 2021  ·  Tao Lei ·

Large language models have become increasingly difficult to train because of the growing computation time and cost. In this work, we present SRU++, a highly-efficient architecture that combines fast recurrence and attention for sequence modeling. SRU++ exhibits strong modeling capacity and training efficiency. On standard language modeling tasks such as Enwik8, Wiki-103 and Billion Word datasets, our model obtains better bits-per-character and perplexity while using 3x-10x less training cost compared to top-performing Transformer models. For instance, our model achieves a state-of-the-art result on the Enwik8 dataset using 1.6 days of training on an 8-GPU machine. We further demonstrate that SRU++ requires minimal attention for near state-of-the-art performance. Our results suggest jointly leveraging fast recurrence with little attention as a promising direction for accelerating model training and inference.

PDF Abstract EMNLP 2021 PDF EMNLP 2021 Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Language Modelling enwik8 SRU++ Large Bit per Character (BPC) 0.95 # 4
Number of params 195M # 9
Language Modelling enwik8 SRU++ Base Bit per Character (BPC) 0.97 # 8
Number of params 108M # 12
Language Modelling One Billion Word SRU++ Large PPL 23.5 # 6
Number of params 465M # 25
Language Modelling One Billion Word SRU++ PPL 25.1 # 12
Number of params 328M # 24
Language Modelling WikiText-103 SRU++ Base Validation perplexity 17.5 # 13
Test perplexity 18.3 # 33
Number of params 148M # 31
Language Modelling WikiText-103 SRU++ Large Validation perplexity 16.4 # 8
Test perplexity 17.1 # 20
Number of params 234M # 27

Methods