MUNIChus: Multilingual News Image Captioning Benchmark

Yuji Chen; Alistair Plum; Hansi Hettiarachchi; Diptesh Kanojia; Saroj Basnet; Marcos Zampieri; Tharindu Ranasinghe

MUNIChus: Multilingual News Image Captioning Benchmark

Yuji Chen, Alistair Plum, Hansi Hettiarachchi, Diptesh Kanojia, Saroj Basnet, Marcos Zampieri, Tharindu Ranasinghe

TL;DR

The first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu is created and evaluates various state-of-the-art neural news image captioning models.

Abstract

The goal of news image captioning is to generate captions by integrating news article content with corresponding images, highlighting the relationship between textual context and visual elements. The majority of research on news image captioning focuses on English, primarily because datasets in other languages are scarce. To address this limitation, we create the first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu. We evaluate various state-of-the-art neural news image captioning models on MUNIChus and find that news image captioning remains challenging. We also make MUNIChus publicly available with over 20 models already benchmarked. MUNIChus opens new avenues for further advancements in developing and evaluating multilingual news image captioning models.

MUNIChus: Multilingual News Image Captioning Benchmark

TL;DR

Abstract

Paper Structure (16 sections, 4 figures, 2 tables)

This paper contains 16 sections, 4 figures, 2 tables.

Introduction
MUNIChus: Multilingual News Image Captionning Benchmark
Evaluation
Methodology
Prompt-based Generation
Zero-shot
Random Few-shot
Similar Few-shot
Instruction Fine-tuning
Baselines
Results
Conclusion
Acknowledgement
Ethics Statement
Bibliographical References
...and 1 more sections

Figures (4)

Figure 1: Comparison of news and generic image captions for two different images. The generic image captions were generated by BLIP pmlr-v162-li22n.
Figure 2: Prompt template used for zero-shot image captioning. The {language} placeholder is replaced with the target language name (e.g., "English", "Arabic") at inference time.
Figure 3: Comparison of the actual news image caption and the generated caption by the best model - GPT-4o random few-shot approach.
Figure 4: Comparison of similar images retrieved for the similar shot approach. The first image is the test instance, and the rest of the images are the most similar images retrieved from the training set.

MUNIChus: Multilingual News Image Captioning Benchmark

TL;DR

Abstract

MUNIChus: Multilingual News Image Captioning Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (4)