Table of Contents
Fetching ...

Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization

Iñigo Pikabea, Iñaki Lacunza, Oriol Pareras, Carlos Escolano, Aitor Gonzalez-Agirre, Javier Hernando, Marta Villegas

TL;DR

The paper tackles Image-induced Fidelity Loss (IFL) in Visual Language Models by introducing multilingual textual regularization, injecting text-only multilingual data during visual instruction tuning to preserve multilingual capabilities. It further investigates model merging to fuse multilingual fidelity with visual skills. Empirical results show reduced English bias across languages while maintaining core multimodal performance, though merging introduces trade-offs in some tasks. The approach is scalable and avoids costly multimodal multilingual data collection, offering a practical path toward global, multilingual VLM deployment.

Abstract

Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model's original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.

Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization

TL;DR

The paper tackles Image-induced Fidelity Loss (IFL) in Visual Language Models by introducing multilingual textual regularization, injecting text-only multilingual data during visual instruction tuning to preserve multilingual capabilities. It further investigates model merging to fuse multilingual fidelity with visual skills. Empirical results show reduced English bias across languages while maintaining core multimodal performance, though merging introduces trade-offs in some tasks. The approach is scalable and avoids costly multimodal multilingual data collection, offering a practical path toward global, multilingual VLM deployment.

Abstract

Rapid advancements in Visual Language Models (VLMs) have transformed multimodal understanding but are often constrained by generating English responses regardless of the input language. This phenomenon has been termed as Image-induced Fidelity Loss (IFL) and stems from limited multimodal multilingual training data. To address this, we propose a continuous multilingual integration strategy that injects text-only multilingual data during visual instruction tuning, preserving the language model's original multilingual capabilities. Extensive evaluations demonstrate that our approach significantly improves linguistic fidelity across languages without degradation in visual performance. We also explore model merging, which improves language fidelity but comes at the cost of visual performance. In contrast, our core method achieves robust multilingual alignment without trade-offs, offering a scalable and effective path to mitigating IFL for global VLM adoption.

Paper Structure

This paper contains 46 sections, 1 equation, 18 figures, 13 tables.

Figures (18)

  • Figure 1: Language Fidelity (LF) accuracy on Crossmodal-3600. (BM: Base Model, TR: model trained with multilingual Textual Regularization, TR+M: TR and merging the final model with the original LLM Backbone)
  • Figure 2: Distribution of the multilingual text-only data used for Textual Regularization. Languages with a volume smaller than 3% are grouped under Others, which collectively account for 5.5% of the data. The most frequent languages in this group are Portuguese (2.1%), Italian (0.7%), Polish (0.47%), Swedish (0.42%), Irish (0.39%), Lithuanian (0.29%), Galician (0.22%), Greek (0.20%), and Ukrainian (0.17%).
  • Figure 3: Interval Plot contrasting LF (upper bars) vs. LF+ (lower bars) across languages of our best-performing models.
  • Figure 4: Prompt used to evaluate language consistency via LLM-as-a-judge. The evaluator model assesses the language fidelity of the caption generated by the VLM using multiple criteria. Note that this evaluation focuses solely on language fidelity, not the overall quality of the caption.
  • Figure 5: Prompts used to evaluate via LLM-as-a-judge the language consistency of the caption provided by the model.
  • ...and 13 more figures