TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Image-guided Story Ending Generation	LSMDC-E	MMT	BLEU-1	18.52	# 1
Image-guided Story Ending Generation	LSMDC-E	MMT	BLEU-2	5.99	# 1
Image-guided Story Ending Generation	LSMDC-E	MMT	BLEU-3	2.51	# 1
Image-guided Story Ending Generation	LSMDC-E	MMT	BLEU-4	1.13	# 1
Image-guided Story Ending Generation	LSMDC-E	MMT	METEOR	12.87	# 1
Image-guided Story Ending Generation	LSMDC-E	MMT	CIDEr	12.41	# 1
Image-guided Story Ending Generation	LSMDC-E	MMT	ROUGE-L	20.99	# 1
Image-guided Story Ending Generation	VIST-E	MMT	BLEU-1	22.87	# 1
Image-guided Story Ending Generation	VIST-E	MMT	BLEU-2	8.68	# 1
Image-guided Story Ending Generation	VIST-E	MMT	BLEU-3	4.38	# 1
Image-guided Story Ending Generation	VIST-E	MMT	BLEU-4	2.61	# 1
Image-guided Story Ending Generation	VIST-E	MMT	METEOR	15.55	# 1
Image-guided Story Ending Generation	VIST-E	MMT	CIDEr	25.41	# 1
Image-guided Story Ending Generation	VIST-E	MMT	ROUGE-L	23.61	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mmt-image-guided-story-ending-generation-with/image-guided-story-ending-generation-on-lsmdc)](https://paperswithcode.com/sota/image-guided-story-ending-generation-on-lsmdc?p=mmt-image-guided-story-ending-generation-with)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/mmt-image-guided-story-ending-generation-with/image-guided-story-ending-generation-on-vist)](https://paperswithcode.com/sota/image-guided-story-ending-generation-on-vist?p=mmt-image-guided-story-ending-generation-with)`

MMT: Image-guided Story Ending Generation with Multimodal Memory Transformer

ACM MM 2022 · Dizhan Xue, Shengsheng Qian, Quan Fang, Changsheng Xu ·

As a specific form of story generation, Image-guided Story Ending Generation (IgSEG) is a recently proposed task of generating a story ending for a given multi-sentence story plot and an ending-related image. Unlike existing image captioning tasks or story ending generation tasks, IgSEG aims to generate a factual description that conforms to both the contextual logic and the relevant visual concepts. To date, existing methods for IgSEG ignore the relationships between the multimodal information and do not integrate multimodal features appropriately. Therefore, in this work, we propose Multimodal Memory Transformer (MMT), an end-to-end framework that models and fuses both contextual and visual information to effectively capture the multimodal dependency for IgSEG. Firstly, we extract textual and visual features separately by employing modality-specific large-scale pretrained encoders. Secondly, we utilize the memory-augmented cross-modal attention network to learn cross-modal relationships and conduct the fine-grained feature fusion effectively. Finally, a multimodal transformer decoder constructs attention among multimodal features to learn the story dependency and generates informative, reasonable, and coherent story endings. In experiments, extensive automatic evaluation results and human evaluation results indicate the significant performance boost of our proposed MMT over state-of-the-art methods on two benchmark datasets.

PDF Abstract

Code

Add Remove Mark official

LivXue/MMT

Tasks

Add Remove

Decoder

Image Captioning

Image-guided Story Ending Generation

Sentence

Story Generation

Datasets

Introduced in the Paper:

LSMDC-E

Used in the Paper:

VIST-E

Results from the Paper

Add Remove

Ranked #1 on Image-guided Story Ending Generation on LSMDC-E

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Image-guided Story Ending Generation	LSMDC-E	MMT	BLEU-1	18.52	# 1	Compare
			BLEU-2	5.99	# 1	Compare
			BLEU-3	2.51	# 1	Compare
			BLEU-4	1.13	# 1	Compare
			METEOR	12.87	# 1	Compare
			CIDEr	12.41	# 1	Compare
			ROUGE-L	20.99	# 1	Compare
Image-guided Story Ending Generation	VIST-E	MMT	BLEU-1	22.87	# 1	Compare
			BLEU-2	8.68	# 1	Compare
			BLEU-3	4.38	# 1	Compare
			BLEU-4	2.61	# 1	Compare
			METEOR	15.55	# 1	Compare
			CIDEr	25.41	# 1	Compare
			ROUGE-L	23.61	# 1	Compare

Methods

Add Remove

Absolute Position Encodings • Adam • BPE • Dense Connections • Dropout • Label Smoothing • Layer Normalization • Linear Layer • Multi-Head Attention • Position-Wise Feed-Forward Layer • Residual Connection • Scaled Dot-Product Attention • Softmax • Transformer

Edit Social Preview

MMT: Image-guided Story Ending Generation with Multimodal Memory Transformer

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove