EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning
Junzhe Zhang, Huixuan Zhang, Xunjian Yin, Xiaojun Wan
TL;DR
EAMA targets the challenge of producing entity-rich captions for news images by aligning Multimodal Large Language Models with two entity-focused tasks—Entity-Aware Sentence Selection and Entity Selection—in addition to the standard News Image Captioning objective. The method then self-supplements the input article context with extracted related sentences and entities to guide caption generation, without introducing extra retrieval modules. Empirical results on GoodNews and NYTimes800k show that EAMA achieves state-of-the-art CIDEr scores and higher named-entity recall compared to strong baselines, including OSFT-enhanced InstructBLIP, while maintaining competitive entity generation. The approach demonstrates that carefully designed alignment tasks combined with concise, entity-relevant textual augmentation can substantially improve entity-rich NIC in practical settings.
Abstract
News image captioning requires model to generate an informative caption rich in entities, with the news image and the associated news article. Current MLLMs still bear limitations in handling entity information in news image captioning tasks. Besides, generating high-quality news image captions requires a trade-off between sufficiency and conciseness of textual input information. To explore the potential of MLLMs and address problems we discovered, we propose EAMA: an Entity-Aware Multimodal Alignment based approach for News Image Captioning. Our approach first aligns the MLLM with two extra alignment tasks: Entity-Aware Sentence Selection task and Entity Selection task, together with News Image Captioning task. The aligned MLLM will utilize the additional entity-related information extracted by itself to supplement the textual input while generating news image captions. Our approach achieves better results than all previous models on two mainstream news image captioning datasets.
