TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Keyword Spotting	Google Speech Commands	KWT-3	Google Speech Commands V1 12	97.49 ±0.15	# 5
Keyword Spotting	Google Speech Commands	KWT-3	Google Speech Commands V2 12	98.56 ±0.07	# 2
Keyword Spotting	Google Speech Commands	KWT-3	Google Speech Commands V2 35	97.69 ±0.09	# 7
Keyword Spotting	Google Speech Commands	KWT-2	Google Speech Commands V1 12	97.27 ±0.08	# 8
Keyword Spotting	Google Speech Commands	KWT-2	Google Speech Commands V2 12	98.43±0.08	# 4
Keyword Spotting	Google Speech Commands	KWT-2	Google Speech Commands V2 35	97.74 ±0.03	# 6
Keyword Spotting	Google Speech Commands	KWT-1	Google Speech Commands V1 12	97.26±0.18	# 9
Keyword Spotting	Google Speech Commands	KWT-1	Google Speech Commands V2 12	98.08±0.10	# 7
Keyword Spotting	Google Speech Commands	KWT-1	Google Speech Commands V2 35	96.95±0.14	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/keyword-transformer-a-self-attention-model/keyword-spotting-on-google-speech-commands)](https://paperswithcode.com/sota/keyword-spotting-on-google-speech-commands?p=keyword-transformer-a-self-attention-model)`

Keyword Transformer: A Self-Attention Model for Keyword Spotting

1 Apr 2021 · Axel Berg, Mark O'Connor, Miguel Tairum Cruz ·

The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.

PDF Abstract

Code

Add Remove Mark official

ARM-software/keyword-transformer official

114

EscVM/EscVM_YT

ID56/Torch-KWT

↳ Quickstart in

Colab

KrishnaDN/Keyword-Transformer

holgerbovbjerg/data2vec-kws

See all 9 implementations

Tasks

Add Remove

Keyword Spotting

Speech Recognition

Datasets

Speech Commands

Results from the Paper

Edit

Ranked #5 on Keyword Spotting on Google Speech Commands (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Keyword Spotting	Google Speech Commands	KWT-3	Google Speech Commands V1 12	97.49 ±0.15	# 5	Compare
			Google Speech Commands V2 12	98.56 ±0.07	# 2	Compare
			Google Speech Commands V2 35	97.69 ±0.09	# 7	Compare
Keyword Spotting	Google Speech Commands	KWT-2	Google Speech Commands V1 12	97.27 ±0.08	# 8	Compare
			Google Speech Commands V2 12	98.43±0.08	# 4	Compare
			Google Speech Commands V2 35	97.74 ±0.03	# 6	Compare
Keyword Spotting	Google Speech Commands	KWT-1	Google Speech Commands V1 12	97.26±0.18	# 9	Compare
			Google Speech Commands V2 12	98.08±0.10	# 7	Compare
			Google Speech Commands V2 35	96.95±0.14	# 10	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Keyword Transformer: A Self-Attention Model for Keyword Spotting

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove