Keyword Transformer: A Self-Attention Model for Keyword Spotting

1 Apr 2021  ·  Axel Berg, Mark O'Connor, Miguel Tairum Cruz ·

The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.

PDF Abstract

Datasets


Results from the Paper


Ranked #5 on Keyword Spotting on Google Speech Commands (using extra training data)

     Get a GitHub badge
Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training Data
Result Benchmark
Keyword Spotting Google Speech Commands KWT-3 Google Speech Commands V1 12 97.49 ±0.15 # 5
Google Speech Commands V2 12 98.56 ±0.07 # 2
Google Speech Commands V2 35 97.69 ±0.09 # 7
Keyword Spotting Google Speech Commands KWT-2 Google Speech Commands V1 12 97.27 ±0.08 # 8
Google Speech Commands V2 12 98.43±0.08 # 4
Google Speech Commands V2 35 97.74 ±0.03 # 6
Keyword Spotting Google Speech Commands KWT-1 Google Speech Commands V1 12 97.26±0.18 # 9
Google Speech Commands V2 12 98.08±0.10 # 7
Google Speech Commands V2 35 96.95±0.14 # 10

Methods