Table of Contents
Fetching ...

Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

Francisco Nogueira, Alexandre Bernardino, Bruno Martins

TL;DR

This work tackles multilingual Referring Expression Comprehension (REC) by addressing English-centric biases through a unified multilingual dataset (≈8 million expressions across 10 languages from 12 English benchmarks) and an attention-anchored grounding model built on frozen SigLIP2 encoders. The dataset construction combines translation (to 9 languages) with multilingual quality enhancement using visual context, yielding broad cross-lingual coverage over 177,620 images and 336,882 objects. The model decomposes localization into coarse spatial anchors derived from attention, followed by residual refinement, and is trained with a three-term loss that jointly optimizes coordinate accuracy, geometric overlap, and attention alignment. Across aggregate and multilingual evaluations, the approach yields competitive results with modest multilingual drops (e.g., Romance languages near English performance), demonstrating practical feasibility for multilingual visual grounding without language-specific architectural changes.

Abstract

Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at $\href{https://multilingual.franreno.com}{multilingual.franreno.com}$.

Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

TL;DR

This work tackles multilingual Referring Expression Comprehension (REC) by addressing English-centric biases through a unified multilingual dataset (≈8 million expressions across 10 languages from 12 English benchmarks) and an attention-anchored grounding model built on frozen SigLIP2 encoders. The dataset construction combines translation (to 9 languages) with multilingual quality enhancement using visual context, yielding broad cross-lingual coverage over 177,620 images and 336,882 objects. The model decomposes localization into coarse spatial anchors derived from attention, followed by residual refinement, and is trained with a three-term loss that jointly optimizes coordinate accuracy, geometric overlap, and attention alignment. Across aggregate and multilingual evaluations, the approach yields competitive results with modest multilingual drops (e.g., Romance languages near English performance), demonstrating practical feasibility for multilingual visual grounding without language-specific architectural changes.

Abstract

Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at .

Paper Structure

This paper contains 22 sections, 7 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: The proposed attention-anchored architecture, featuring a relatively small frozen SigLIP2 encoders, bidirectional cross-modal fusion, text query aggregation, a cross-attention decoder, and attention-guided bounding box prediction with residual refinement.
  • Figure 2: Multilingual performance in terms of IoU@50 across datasets and languages. Rows represent datasets, while columns represent languages. Green indicates higher IoU@50.
  • Figure 3: Translation quality distributions before and after visual enhancement for the nine 9 target languages. Each panel shows overlaid histograms of COMETkiwi-DA scores for original (coral) and enhanced (teal) translations. All languages show rightward distribution shifts, with Chinese exhibiting the most substantial improvement.
  • Figure 4: Multilingual localization consistency demonstrated on a wine tasting scene. The model accurately identifies the second wine glass from the left when queried in Russian (left), English (center), and Italian (right). Despite linguistic variations across three language families (Slavic, Germanic, and Romance), the spatial predictions remain nearly identical, with green boxes indicating the ground truth and red boxes showing the model predictions.
  • Figure 5: Examples illustrating fine-grained object discrimination in a multi-object scene. Three referring expressions targeting different objects in the same office environment demonstrate compositional reasoning: identifying an earphone by material and color (left), a cup by color and text pattern (center), and a glass container by material properties (right). The model successfully disambiguates targets from distractors using combined attribute descriptions.
  • ...and 13 more figures