Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective
Xinmiao Yu, Xiaocheng Feng, Yun Li, Minghui Liao, Ya-Qi Yu, Xiachong Feng, Weihong Zhong, Ruihan Chen, Mengkang Hu, Jihao Wu, Dandan Tu, Duyu Tang, Bing Qin
TL;DR
This work addresses cross-lingual text-rich visual understanding in large vision-language models by introducing XT-VQA, a benchmark that couples existing text-rich VQA datasets with a new XPaperQA dataset to probe instruction-following when image text and questions use different languages. The authors analyze performance gaps through a mutual information lens, defining $I(Y;V|Q)$ to quantify how well visual information activates the model’s outputs under cross-linguistic instructions. To mitigate the gap, they propose MVCL-MI, which maximizes cross-lingual visual-language mutual information via cross-lingual distillation with KL divergence to align cross-language outputs with monolingual anchors, preserving monolingual performance. Experimental results on XT-VQA show that MVCL-MI effectively narrows the cross-lingual gap on both English and Chinese data, demonstrating the value of cross-lingual MI optimization for robust, multilingual LVLMs. The work provides comprehensive datasets, analyses, and training details, with code available at the authors’ repository for reproducibility and further research.
Abstract
Recent Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model's sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions. To address this, we introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions. XT-VQA integrates five existing text-rich VQA datasets and a newly collected dataset, XPaperQA, covering diverse scenarios that require faithful recognition and comprehension of visual information despite language inconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a significant drop in performance for cross-lingual scenarios, even for models with multilingual capabilities. A mutual information analysis suggests that this performance gap stems from cross-lingual questions failing to adequately activate relevant visual information. To mitigate this issue, we propose MVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information), where a visual-text cross-lingual alignment is built by maximizing mutual information between the model's outputs and visual information. This is achieved by distilling knowledge from monolingual to cross-lingual settings through KL divergence minimization, where monolingual output logits serve as a teacher. Experimental results on the XT-VQA demonstrate that MVCL-MI effectively reduces the visual-text cross-lingual performance disparity while preserving the inherent capabilities of LVLMs, shedding new light on the potential practice for improving LVLMs. Codes are available at: https://github.com/Stardust-y/XTVQA.git
