Table of Contents
Fetching ...

MUNIChus: Multilingual News Image Captioning Benchmark

Yuji Chen, Alistair Plum, Hansi Hettiarachchi, Diptesh Kanojia, Saroj Basnet, Marcos Zampieri, Tharindu Ranasinghe

TL;DR

The first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu is created and evaluates various state-of-the-art neural news image captioning models.

Abstract

The goal of news image captioning is to generate captions by integrating news article content with corresponding images, highlighting the relationship between textual context and visual elements. The majority of research on news image captioning focuses on English, primarily because datasets in other languages are scarce. To address this limitation, we create the first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu. We evaluate various state-of-the-art neural news image captioning models on MUNIChus and find that news image captioning remains challenging. We also make MUNIChus publicly available with over 20 models already benchmarked. MUNIChus opens new avenues for further advancements in developing and evaluating multilingual news image captioning models.

MUNIChus: Multilingual News Image Captioning Benchmark

TL;DR

The first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu is created and evaluates various state-of-the-art neural news image captioning models.

Abstract

The goal of news image captioning is to generate captions by integrating news article content with corresponding images, highlighting the relationship between textual context and visual elements. The majority of research on news image captioning focuses on English, primarily because datasets in other languages are scarce. To address this limitation, we create the first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu. We evaluate various state-of-the-art neural news image captioning models on MUNIChus and find that news image captioning remains challenging. We also make MUNIChus publicly available with over 20 models already benchmarked. MUNIChus opens new avenues for further advancements in developing and evaluating multilingual news image captioning models.
Paper Structure (16 sections, 4 figures, 2 tables)

This paper contains 16 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Comparison of news and generic image captions for two different images. The generic image captions were generated by BLIP pmlr-v162-li22n.
  • Figure 2: Prompt template used for zero-shot image captioning. The {language} placeholder is replaced with the target language name (e.g., "English", "Arabic") at inference time.
  • Figure 3: Comparison of the actual news image caption and the generated caption by the best model - GPT-4o random few-shot approach.
  • Figure 4: Comparison of similar images retrieved for the similar shot approach. The first image is the test instance, and the rest of the images are the most similar images retrieved from the training set.