MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching
Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glavaš
TL;DR
MVL-SIB introduces a massively multilingual vision-language benchmark spanning 205 languages to evaluate cross-modal and text-only topical matching. It defines i2s, s2i, t2s, and s2t tasks with varying numbers of reference images or sentences, enabling analysis of single- vs multi-image VL reasoning and separate language vs VL support. Across open-weight LVLMs and GPT-4o-mini, the study reveals strong performance in high-resource languages but dramatic drops for low-resource languages, with VL capabilities lagging behind textual support and limited benefits from multiple images for open-weight models. The benchmark correlates well with existing multilingual VL benchmarks, underscoring MVL-SIB as a robust probe of multilingual vision-language understanding and a call to broaden VL training to low-resource languages.
Abstract
Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages -- over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N'Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.
