MMT: Image-guided Story Ending Generation with Multimodal Memory Transformer

As a specific form of story generation, Image-guided Story Ending Generation (IgSEG) is a recently proposed task of generating a story ending for a given multi-sentence story plot and an ending-related image. Unlike existing image captioning tasks or story ending generation tasks, IgSEG aims to generate a factual description that conforms to both the contextual logic and the relevant visual concepts. To date, existing methods for IgSEG ignore the relationships between the multimodal information and do not integrate multimodal features appropriately. Therefore, in this work, we propose Multimodal Memory Transformer (MMT), an end-to-end framework that models and fuses both contextual and visual information to effectively capture the multimodal dependency for IgSEG. Firstly, we extract textual and visual features separately by employing modality-specific large-scale pretrained encoders. Secondly, we utilize the memory-augmented cross-modal attention network to learn cross-modal relationships and conduct the fine-grained feature fusion effectively. Finally, a multimodal transformer decoder constructs attention among multimodal features to learn the story dependency and generates informative, reasonable, and coherent story endings. In experiments, extensive automatic evaluation results and human evaluation results indicate the significant performance boost of our proposed MMT over state-of-the-art methods on two benchmark datasets.

PDF Abstract

Datasets


Introduced in the Paper:

LSMDC-E

Used in the Paper:

VIST-E
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Image-guided Story Ending Generation LSMDC-E MMT BLEU-1 18.52 # 1
BLEU-2 5.99 # 1
BLEU-3 2.51 # 1
BLEU-4 1.13 # 1
METEOR 12.87 # 1
CIDEr 12.41 # 1
ROUGE-L 20.99 # 1
Image-guided Story Ending Generation VIST-E MMT BLEU-1 22.87 # 1
BLEU-2 8.68 # 1
BLEU-3 4.38 # 1
BLEU-4 2.61 # 1
METEOR 15.55 # 1
CIDEr 25.41 # 1
ROUGE-L 23.61 # 1

Methods