TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Key Information Extraction	CORD	LayoutMask (base)	F1	96.99	# 4
Key Information Extraction	CORD	LayoutMask (large)	F1	97.19	# 3
Named Entity Recognition (NER)	CORD-r	LayoutMask	F1	81.84	# 4
Semantic entity labeling	FUNSD	LayoutMask (large)	F1	93.20	# 1
Semantic entity labeling	FUNSD	LayoutMask (base)	F1	92.91	# 3
Named Entity Recognition (NER)	FUNSD-r	LayoutMask	F1	77.10	# 4

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutmask-enhance-text-layout-interaction-in/semantic-entity-labeling-on-funsd)](https://paperswithcode.com/sota/semantic-entity-labeling-on-funsd?p=layoutmask-enhance-text-layout-interaction-in)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutmask-enhance-text-layout-interaction-in/key-information-extraction-on-cord)](https://paperswithcode.com/sota/key-information-extraction-on-cord?p=layoutmask-enhance-text-layout-interaction-in)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutmask-enhance-text-layout-interaction-in/named-entity-recognition-ner-on-cord-r)](https://paperswithcode.com/sota/named-entity-recognition-ner-on-cord-r?p=layoutmask-enhance-text-layout-interaction-in)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/layoutmask-enhance-text-layout-interaction-in/named-entity-recognition-ner-on-funsd-r)](https://paperswithcode.com/sota/named-entity-recognition-ner-on-funsd-r?p=layoutmask-enhance-text-layout-interaction-in)`

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

30 May 2023 · Yi Tu, Ya Guo, Huan Chen, Jinyang Tang ·

Visually-rich Document Understanding (VrDU) has attracted much research attention over the past years. Pre-trained models on a large number of document images with transformer-based backbones have led to significant performance gains in this field. The major challenge is how to fusion the different modalities (text, layout, and image) of the documents in a unified model with different pre-training tasks. This paper focuses on improving text-layout interactions and proposes a novel multi-modal pre-training model, LayoutMask. LayoutMask uses local 1D position, instead of global 1D position, as layout input and has two pre-training objectives: (1) Masked Language Modeling: predicting masked tokens with two novel masking strategies; (2) Masked Position Modeling: predicting masked 2D positions to improve layout representation learning. LayoutMask can enhance the interactions between text and layout modalities in a unified model and produce adaptive and robust multi-modal representations for downstream tasks. Experimental results show that our proposed method can achieve state-of-the-art results on a wide variety of VrDU problems, including form understanding, receipt understanding, and document image classification.

PDF Abstract

Code

Add Remove Mark official

No code implementations yet. Submit your code now

Tasks

Add Remove

Document Image Classification

document understanding

Image Classification

Key Information Extraction

Language Modelling

Masked Language Modeling

Named Entity Recognition (NER)

Position

Representation Learning

Semantic entity labeling

Datasets

FUNSD

RVL-CDIP CORD

SROIE

CORD-r

FUNSD-r

Results from the Paper

Add Remove

Ranked #1 on Semantic entity labeling on FUNSD

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Key Information Extraction	CORD	LayoutMask (base)	F1	96.99	# 4	Compare
Key Information Extraction	CORD	LayoutMask (large)	F1	97.19	# 3	Compare
Named Entity Recognition (NER)	CORD-r	LayoutMask	F1	81.84	# 4	Compare
Semantic entity labeling	FUNSD	LayoutMask (large)	F1	93.20	# 1	Compare
Semantic entity labeling	FUNSD	LayoutMask (base)	F1	92.91	# 3	Compare
Named Entity Recognition (NER)	FUNSD-r	LayoutMask	F1	77.10	# 4	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove