Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Gregor Geigle; Radu Timofte; Goran Glavaš

Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Gregor Geigle, Radu Timofte, Goran Glavaš

TL;DR

Babel-ImageNet provides the first massively multilingual, translation-free benchmark linking ImageNet to BabelNet via WordNet synsets to produce language-specific class labels for 100 languages. The study shows substantial performance gaps for low-resource languages across 11 public multilingual CLIP models and demonstrates that zero-shot image classification performance correlates with multilingual image-text retrieval, validating Babel-ImageNet as a proxy for multilingual VL representation quality. It also introduces a practical, parameter-efficient language specialization approach using adapters that significantly boosts low-resource language performance, highlighting remaining challenges in cross-language generalization and the need for improved multilingual distillation. The work offers a scalable evaluation resource and actionable methods to advance multilingual VL models, with code and data released for broader community use.

Abstract

Vision-and-language (VL) models with separate encoders for each modality (e.g., CLIP) have become the go-to models for zero-shot image classification and image-text retrieval. They are, however, mostly evaluated in English as multilingual benchmarks are limited in availability. We introduce Babel-ImageNet, a massively multilingual benchmark that offers (partial) translations of ImageNet labels to 100 languages, built without machine translation or manual annotation. We instead automatically obtain reliable translations by linking them -- via shared WordNet synsets -- to BabelNet, a massively multilingual lexico-semantic network. We evaluate 11 public multilingual CLIP models on zero-shot image classification (ZS-IC) on our benchmark, demonstrating a significant gap between English ImageNet performance and that of high-resource languages (e.g., German or Chinese), and an even bigger gap for low-resource languages (e.g., Sinhala or Lao). Crucially, we show that the models' ZS-IC performance highly correlates with their performance in image-text retrieval, validating the use of Babel-ImageNet to evaluate multilingual models for the vast majority of languages without gold image-text data. Finally, we show that the performance of multilingual CLIP can be drastically improved for low-resource languages with parameter-efficient language-specific training. We make our code and data publicly available: \url{https://github.com/gregor-ge/Babel-ImageNet}

Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

TL;DR

Abstract

Paper Structure (33 sections, 7 figures, 12 tables)

This paper contains 33 sections, 7 figures, 12 tables.

Introduction
Related Work
Multilingual Vision-and-Language Benchmarks.
Multilingual CLIP.
Babel-ImageNet
Why (massively) multilingual ZS-IC?
WordNet as a matchmaker for ImageNet and BabelNet.
Class Translation and Cleaning Process.
Language Selection.
Grouping Languages in Evaluation.
Verification.
Benchmarking CLIP Models
Zero-Shot Image Classification Setup.
ZS-IC Results.
Validating Babel-ImageNet
...and 18 more sections

Figures (7)

Figure 1: Illustrating the creation of Babel-ImageNet: ImageNet classes correspond to WordNet IDs, which are integrated into BabelNet, a multilingual semantic net. Through this, we look up synonymous word senses in all available languages, perform some cleaning and filtering and select one sense as label.
Figure 2: English ImageNet results with random subset of the 1k classes (5 random seeds each).
Figure 3: R@1 text-to-image retrieval results on three datasets plotted against Babel-ImageNet performance (each dot denotes the performance of one model for one language) together with a linear regression estimate (95% CI).
Figure 4: Number of classes in Babel-ImageNet plotted against the number of tokens (millions, log10) in the XLM-R pretraining corpus. When taking the XLM-R tokens as proxy for "resourceness" of a language, we see that this generally correlates with the number of classes. Vertical lines indicate the grouping of languages for evaluation.
Figure 5: Average increase within the low/mid/high language groups (with 95% CI) over only labels using English prompts (with non-English labels) and our machine-translated prompts.
...and 2 more figures

Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

TL;DR

Abstract

Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (7)