Table of Contents
Fetching ...

MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching

Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glavaš

TL;DR

MVL-SIB introduces a massively multilingual vision-language benchmark spanning 205 languages to evaluate cross-modal and text-only topical matching. It defines i2s, s2i, t2s, and s2t tasks with varying numbers of reference images or sentences, enabling analysis of single- vs multi-image VL reasoning and separate language vs VL support. Across open-weight LVLMs and GPT-4o-mini, the study reveals strong performance in high-resource languages but dramatic drops for low-resource languages, with VL capabilities lagging behind textual support and limited benefits from multiple images for open-weight models. The benchmark correlates well with existing multilingual VL benchmarks, underscoring MVL-SIB as a robust probe of multilingual vision-language understanding and a call to broaden VL training to low-resource languages.

Abstract

Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages -- over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N'Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.

MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching

TL;DR

MVL-SIB introduces a massively multilingual vision-language benchmark spanning 205 languages to evaluate cross-modal and text-only topical matching. It defines i2s, s2i, t2s, and s2t tasks with varying numbers of reference images or sentences, enabling analysis of single- vs multi-image VL reasoning and separate language vs VL support. Across open-weight LVLMs and GPT-4o-mini, the study reveals strong performance in high-resource languages but dramatic drops for low-resource languages, with VL capabilities lagging behind textual support and limited benefits from multiple images for open-weight models. The benchmark correlates well with existing multilingual VL benchmarks, underscoring MVL-SIB as a robust probe of multilingual vision-language understanding and a call to broaden VL training to low-resource languages.

Abstract

Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages -- over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N'Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.

Paper Structure

This paper contains 27 sections, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Cross-modal topic matching 'Images-To-Sentence' for German with $k{=}5$ reference images.
  • Figure 2: Images-To-Sentences @ $k{=}3$. The English prompt describes the cross-modal topic matching task, lists all topics, and provides both $k{=}3$ reference images and 4 sentences in the corresponding language {eng_Latn, $\dots$, nqo_Nkoo}. LVLMs must select the sentence of 4 options that topically fits $k{=}3$ reference images. The sentences spanning 205 languages and 7 topics are drawn from SIB-200 adelani-etal-2024-sib, while images for the topics were hand-selected (cf. Appendix \ref{['app:images-per-topic']}). An example prompt is shown in Appendix \ref{['app:images-to-sentences']}; further details are in §\ref{['sec:experimental-setup']}. Plot. The x-axis orders the languages of the candidate sentences {eng_Latn, $\dots$, nqo_Nkoo}, respectively, by descending performance (y-axis). The top x-axis indicates the running index of each language $L_i$ ($i \in \{1, \dots, 205\}$).
  • Figure 3: Correlations Between MVL-SIB & Multilingual VL Benchmarks. Pearson correlation coefficients obtained by regressing MVL-SIB performance against performance on multilingual VL tasks on languages common to both datasets, respectively. An asterisk (*) indicates whether the coefficient is statistically significant at $p \leq 0.05$.
  • Figure 5: Larger LVLMs on Subsampled Tiers. We extract 3 languages per tier that mimic avg. performance full language groups (cf. §\ref{['subsec:further-analyses']}) and evaluate LVLMs across all model sizes on {i2s,s2i,t2s,s2t} @ $k{=}3$ (cf. §\ref{['sec:tasks']}).
  • Figure 6: Images-To-Sentences @ $k{=}3$. The English prompt describes the cross-modal topic matching task, lists all topics, and provides both $k{=}3$ reference images and 4 sentences in the corresponding language {eng_Latn, $\dots$, nqo_Nkoo}. LVLMs must select the sentence of 4 options that topically fits $k{=}3$ reference images. The sentences spanning 205 languages and 7 topics are drawn from SIB-200 adelani-etal-2024-sib, while images for the topics were hand-selected (cf. Appendix \ref{['app:images-per-topic']}). An example prompt is shown in Appendix \ref{['app:images-to-sentences']}; further details are in §\ref{['sec:experimental-setup']}. Plot. The x-axis orders the languages of the candidate sentences {eng_Latn, $\dots$, nqo_Nkoo}, respectively, by descending performance (y-axis). The top x-axis indicates the running index of each language $L_i$ ($i \in \{1, \dots, 205\}$). Tiers. The languages are grouped by tiers derived from joshi-etal-2020-state (cf. §\ref{['sec:results']}).
  • ...and 4 more figures