TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Abstractive Text Summarization	CNN / Daily Mail	Mask Attention Network	ROUGE-1	40.98	# 36
Abstractive Text Summarization	CNN / Daily Mail	Mask Attention Network	ROUGE-2	18.29	# 36
Abstractive Text Summarization	CNN / Daily Mail	Mask Attention Network	ROUGE-L	37.88	# 35
Text Summarization	GigaWord	Mask Attention Network	ROUGE-1	38.28	# 19
Text Summarization	GigaWord	Mask Attention Network	ROUGE-2	19.46	# 19
Text Summarization	GigaWord	Mask Attention Network	ROUGE-L	35.46	# 19
Machine Translation	IWSLT2014 German-English	Mask Attention Network (small)	BLEU score	36.3	# 14
Machine Translation	IWSLT2014 German-English	Mask Attention Network (small)	Number of Params	37M	# 2
Machine Translation	WMT2014 English-German	Mask Attention Network (big)	BLEU score	30.4	# 11
Machine Translation	WMT2014 English-German	Mask Attention Network (big)	Number of Params	215M	# 5
Machine Translation	WMT2014 English-German	Mask Attention Network (big)	Hardware Burden	None	# 1
Machine Translation	WMT2014 English-German	Mask Attention Network (big)	Operations per network pass	None	# 1
Machine Translation	WMT2014 English-German	Mask Attention Network (base)	BLEU score	29.1	# 33
Machine Translation	WMT2014 English-German	Mask Attention Network (base)	Number of Params	63M	# 12
Machine Translation	WMT2014 English-German	Mask Attention Network (base)	Hardware Burden	None	# 1
Machine Translation	WMT2014 English-German	Mask Attention Network (base)	Operations per network pass	None	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mask-attention-networks-rethinking-and/machine-translation-on-wmt2014-english-german)](https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-german?p=mask-attention-networks-rethinking-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mask-attention-networks-rethinking-and/machine-translation-on-iwslt2014-german)](https://paperswithcode.com/sota/machine-translation-on-iwslt2014-german?p=mask-attention-networks-rethinking-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mask-attention-networks-rethinking-and/text-summarization-on-gigaword)](https://paperswithcode.com/sota/text-summarization-on-gigaword?p=mask-attention-networks-rethinking-and)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mask-attention-networks-rethinking-and/abstractive-text-summarization-on-cnn-daily)](https://paperswithcode.com/sota/abstractive-text-summarization-on-cnn-daily?p=mask-attention-networks-rethinking-and)`

Mask Attention Networks: Rethinking and Strengthen Transformer

NAACL 2021 · Zhihao Fan, Yeyun Gong, Dayiheng Liu, Zhongyu Wei, Siyuan Wang, Jian Jiao, Nan Duan, Ruofei Zhang, Xuanjing Huang ·

Transformer is an attention-based neural network, which consists of two sublayers, namely, Self-Attention Network (SAN) and Feed-Forward Network (FFN). Existing research explores to enhance the two sublayers separately to improve the capability of Transformer for text representation. In this paper, we present a novel understanding of SAN and FFN as Mask Attention Networks (MANs) and show that they are two special cases of MANs with static mask matrices. However, their static mask matrices limit the capability for localness modeling in text representation learning. We therefore introduce a new layer named dynamic mask attention network (DMAN) with a learnable mask matrix which is able to model localness adaptively. To incorporate advantages of DMAN, SAN, and FFN, we propose a sequential layered structure to combine the three types of layers. Extensive experiments on various tasks, including neural machine translation and text summarization demonstrate that our model outperforms the original Transformer.

PDF Abstract NAACL 2021 PDF NAACL 2021 Abstract

Code

Add Remove Mark official

libertfan/man

Tasks

Add Remove

Abstractive Text Summarization

Machine Translation

Representation Learning

Text Summarization

Translation

Datasets

CNN/Daily Mail

WMT 2014

Results from the Paper

Edit

Ranked #11 on Machine Translation on WMT2014 English-German

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Abstractive Text Summarization	CNN / Daily Mail	Mask Attention Network	ROUGE-1	40.98	# 36	Compare
			ROUGE-2	18.29	# 36	Compare
			ROUGE-L	37.88	# 35	Compare
Text Summarization	GigaWord	Mask Attention Network	ROUGE-1	38.28	# 19	Compare
			ROUGE-2	19.46	# 19	Compare
			ROUGE-L	35.46	# 19	Compare
Machine Translation	IWSLT2014 German-English	Mask Attention Network (small)	BLEU score	36.3	# 14	Compare
Machine Translation	IWSLT2014 German-English	Mask Attention Network (small)	Number of Params	37M	# 2	Compare
Machine Translation	WMT2014 English-German	Mask Attention Network (big)	BLEU score	30.4	# 11	Compare
			Number of Params	215M	# 5	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare
Machine Translation	WMT2014 English-German	Mask Attention Network (base)	BLEU score	29.1	# 33	Compare
			Number of Params	63M	# 12	Compare
			Hardware Burden	None	# 1	Compare
			Operations per network pass	None	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

Mask Attention Networks: Rethinking and Strengthen Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove