Table of Contents
Fetching ...

Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?

Gonçalo Gomes, Chrysoula Zerva, Bruno Martins

TL;DR

Multilingual image captioning evaluation has been underexplored, as CLIPScore is largely English-centric. The paper extends CLIPScore to multilingual settings through quality-aware machine translation of English benchmarks and repurposing multilingual datasets, and introduces a dual-loss finetuning framework using $L_C$ (contrastive) and $L_P$ (Pearson) on CrossModal-3600 and VICR. Results show the finetuned multilingual CLIPScore (MLF) yields higher cross-language correlations with human judgments across languages and often matches or exceeds English-only baselines on multilingual data, with additional validation on native multilingual benchmarks XVNLI, MaRVL, and VALSE. These findings demonstrate scalable multilingual evaluation pipelines, cost-effective data expansion via MT, and broader applicability for real-world multilingual caption assessment; code and datasets are publicly available.

Abstract

The evaluation of image captions, looking at both linguistic fluency and semantic correspondence to visual contents, has witnessed a significant effort. Still, despite advancements such as the CLIPScore metric, multilingual captioning evaluation has remained relatively unexplored. This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings. To address the lack of multilingual test data, we consider two different strategies: (1) using quality aware machine-translated datasets with human judgements, and (2) re-purposing multilingual datasets that target semantic inference and reasoning. Our results highlight the potential of finetuned multilingual models to generalize across languages and to handle complex linguistic challenges. Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages, and additional tests with natively multilingual and multicultural data further attest to the high-quality assessments.

Evaluation of Multilingual Image Captioning: How far can we get with CLIP models?

TL;DR

Multilingual image captioning evaluation has been underexplored, as CLIPScore is largely English-centric. The paper extends CLIPScore to multilingual settings through quality-aware machine translation of English benchmarks and repurposing multilingual datasets, and introduces a dual-loss finetuning framework using (contrastive) and (Pearson) on CrossModal-3600 and VICR. Results show the finetuned multilingual CLIPScore (MLF) yields higher cross-language correlations with human judgments across languages and often matches or exceeds English-only baselines on multilingual data, with additional validation on native multilingual benchmarks XVNLI, MaRVL, and VALSE. These findings demonstrate scalable multilingual evaluation pipelines, cost-effective data expansion via MT, and broader applicability for real-world multilingual caption assessment; code and datasets are publicly available.

Abstract

The evaluation of image captions, looking at both linguistic fluency and semantic correspondence to visual contents, has witnessed a significant effort. Still, despite advancements such as the CLIPScore metric, multilingual captioning evaluation has remained relatively unexplored. This work presents several strategies, and extensive experiments, related to evaluating CLIPScore variants in multilingual settings. To address the lack of multilingual test data, we consider two different strategies: (1) using quality aware machine-translated datasets with human judgements, and (2) re-purposing multilingual datasets that target semantic inference and reasoning. Our results highlight the potential of finetuned multilingual models to generalize across languages and to handle complex linguistic challenges. Tests with machine-translated data show that multilingual CLIPScore models can maintain a high correlation with human judgements across different languages, and additional tests with natively multilingual and multicultural data further attest to the high-quality assessments.

Paper Structure

This paper contains 13 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Pearson correlation scores between different languages, for the original multilingual CLIPScore model (squared cells) and our finetuned version (circular cells). The first heatmap considers the complete set of instances from the VICR dataset, reporting results for both the original and finetuned model versions (lower/upper diagonal values). The second and third heatmaps consider the subset of instances with COMETKiwi scores below/above the 25th/75th percentile value for each language (lower/upper diagonal values), for the original multilingual CLIPScore model and our finetuned model version, respectively.
  • Figure 2: The three different XVNLI multilingual classification tasks, where accuracy is defined with basis on comparisons between CLIPScore values.