EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning

Junzhe Zhang; Huixuan Zhang; Xunjian Yin; Xiaojun Wan

EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning

Junzhe Zhang, Huixuan Zhang, Xunjian Yin, Xiaojun Wan

TL;DR

EAMA targets the challenge of producing entity-rich captions for news images by aligning Multimodal Large Language Models with two entity-focused tasks—Entity-Aware Sentence Selection and Entity Selection—in addition to the standard News Image Captioning objective. The method then self-supplements the input article context with extracted related sentences and entities to guide caption generation, without introducing extra retrieval modules. Empirical results on GoodNews and NYTimes800k show that EAMA achieves state-of-the-art CIDEr scores and higher named-entity recall compared to strong baselines, including OSFT-enhanced InstructBLIP, while maintaining competitive entity generation. The approach demonstrates that carefully designed alignment tasks combined with concise, entity-relevant textual augmentation can substantially improve entity-rich NIC in practical settings.

Abstract

News image captioning requires model to generate an informative caption rich in entities, with the news image and the associated news article. Current MLLMs still bear limitations in handling entity information in news image captioning tasks. Besides, generating high-quality news image captions requires a trade-off between sufficiency and conciseness of textual input information. To explore the potential of MLLMs and address problems we discovered, we propose EAMA: an Entity-Aware Multimodal Alignment based approach for News Image Captioning. Our approach first aligns the MLLM with two extra alignment tasks: Entity-Aware Sentence Selection task and Entity Selection task, together with News Image Captioning task. The aligned MLLM will utilize the additional entity-related information extracted by itself to supplement the textual input while generating news image captions. Our approach achieves better results than all previous models on two mainstream news image captioning datasets.

EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 6 figures, 6 tables)

This paper contains 25 sections, 6 equations, 6 figures, 6 tables.

Introduction
Related Works
News Image Captioning
MLLMs and Alignment
Approach
Overview
Entity-Aware Multimodal Alignment
Self-Supplemented Generation
Experiment Setup
Datasets & Metrics
Implementation Details
Experimental Results
Performance of MLLMs on NIC
Comparison Results of Different Methods
Ablation Study
...and 10 more sections

Figures (6)

Figure 1: An example of news image caption task and responses of the MLLM in different settings. Origin denotes Zero_Shot, OSFT denotes finetuning the MLLM in official settings. EAMA denotes our approach.
Figure 2: An overview of our approach: EAMA. The left part represents Entity-Aware Multimodal Alignment with Entity-Aware Sentence Selection, Entity Selection and News Image Captioning task. The right part represents the Self-Supplemented Generation for the news image caption given the news article and the news image. The aligned MLLM first extracts related sentences and entities and then generates the news image caption with extracted information as a supplement.
Figure 3: Performance of MLLMs in Zero_Shot and Official supported Supervised Finetune (OSFT) setting. Following the OSFT settings, we train the V-L Connector in MiniGPT-v2 and InstructBLIP, and fully train the V-L Connector together with LLM in LLaVA-v1.5. LLaVA-v1.5, MiniGPT-v2 and InstructBLIP are implemented with 7B checkpoints.
Figure 4: An example of our prompts for Entity-Aware Sentence Selection, Entity Selection and News Image Captioning tasks. We label entities in the caption in green. We label entities in the input context but not in the caption in blue.
Figure 5: An example of news image caption generation. We present the news image, news article and the news image caption of one sample from NYTimes800k dataset. The news image captions are respectively, generated in: (1) InstructBLIP in zero-shot setting. (2) LLaVA-v1.5 in zero-shot setting. (3) MiniGPT-v2 in zero-shot setting. (4) GPT-4V in zero-shot setting. (5) InstructBLIP (OSFT) on NYTimes800k. (6) LLaVA-v1.5 (OSFT) on NYTimes800k. (7) MiniGPT-v2 (OSFT) on NYTimes800k. (8) Our method, EAMA, on NYTimes800k. (9) is the ground-truth caption. We highlight Entities occurred in reference in red.
...and 1 more figures

EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning

TL;DR

Abstract

EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)