Table of Contents
Fetching ...

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective

Xinmiao Yu, Xiaocheng Feng, Yun Li, Minghui Liao, Ya-Qi Yu, Xiachong Feng, Weihong Zhong, Ruihan Chen, Mengkang Hu, Jihao Wu, Dandan Tu, Duyu Tang, Bing Qin

TL;DR

This work addresses cross-lingual text-rich visual understanding in large vision-language models by introducing XT-VQA, a benchmark that couples existing text-rich VQA datasets with a new XPaperQA dataset to probe instruction-following when image text and questions use different languages. The authors analyze performance gaps through a mutual information lens, defining $I(Y;V|Q)$ to quantify how well visual information activates the model’s outputs under cross-linguistic instructions. To mitigate the gap, they propose MVCL-MI, which maximizes cross-lingual visual-language mutual information via cross-lingual distillation with KL divergence to align cross-language outputs with monolingual anchors, preserving monolingual performance. Experimental results on XT-VQA show that MVCL-MI effectively narrows the cross-lingual gap on both English and Chinese data, demonstrating the value of cross-lingual MI optimization for robust, multilingual LVLMs. The work provides comprehensive datasets, analyses, and training details, with code available at the authors’ repository for reproducibility and further research.

Abstract

Recent Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model's sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions. To address this, we introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions. XT-VQA integrates five existing text-rich VQA datasets and a newly collected dataset, XPaperQA, covering diverse scenarios that require faithful recognition and comprehension of visual information despite language inconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a significant drop in performance for cross-lingual scenarios, even for models with multilingual capabilities. A mutual information analysis suggests that this performance gap stems from cross-lingual questions failing to adequately activate relevant visual information. To mitigate this issue, we propose MVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information), where a visual-text cross-lingual alignment is built by maximizing mutual information between the model's outputs and visual information. This is achieved by distilling knowledge from monolingual to cross-lingual settings through KL divergence minimization, where monolingual output logits serve as a teacher. Experimental results on the XT-VQA demonstrate that MVCL-MI effectively reduces the visual-text cross-lingual performance disparity while preserving the inherent capabilities of LVLMs, shedding new light on the potential practice for improving LVLMs. Codes are available at: https://github.com/Stardust-y/XTVQA.git

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective

TL;DR

This work addresses cross-lingual text-rich visual understanding in large vision-language models by introducing XT-VQA, a benchmark that couples existing text-rich VQA datasets with a new XPaperQA dataset to probe instruction-following when image text and questions use different languages. The authors analyze performance gaps through a mutual information lens, defining to quantify how well visual information activates the model’s outputs under cross-linguistic instructions. To mitigate the gap, they propose MVCL-MI, which maximizes cross-lingual visual-language mutual information via cross-lingual distillation with KL divergence to align cross-language outputs with monolingual anchors, preserving monolingual performance. Experimental results on XT-VQA show that MVCL-MI effectively narrows the cross-lingual gap on both English and Chinese data, demonstrating the value of cross-lingual MI optimization for robust, multilingual LVLMs. The work provides comprehensive datasets, analyses, and training details, with code available at the authors’ repository for reproducibility and further research.

Abstract

Recent Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. However, the abundant text within such images may increase the model's sensitivity to language. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions. To address this, we introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions. XT-VQA integrates five existing text-rich VQA datasets and a newly collected dataset, XPaperQA, covering diverse scenarios that require faithful recognition and comprehension of visual information despite language inconsistency. Our evaluation of prominent LVLMs on XT-VQA reveals a significant drop in performance for cross-lingual scenarios, even for models with multilingual capabilities. A mutual information analysis suggests that this performance gap stems from cross-lingual questions failing to adequately activate relevant visual information. To mitigate this issue, we propose MVCL-MI (Maximization of Vision-Language Cross-Lingual Mutual Information), where a visual-text cross-lingual alignment is built by maximizing mutual information between the model's outputs and visual information. This is achieved by distilling knowledge from monolingual to cross-lingual settings through KL divergence minimization, where monolingual output logits serve as a teacher. Experimental results on the XT-VQA demonstrate that MVCL-MI effectively reduces the visual-text cross-lingual performance disparity while preserving the inherent capabilities of LVLMs, shedding new light on the potential practice for improving LVLMs. Codes are available at: https://github.com/Stardust-y/XTVQA.git

Paper Structure

This paper contains 32 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: An example of the LVLM answering unfaithfully when questions were posed in languages different from those in the image. The LVLM made unfaithful recognition and comprehension of Chinese and French while answering correctly with English questions. Reveals the challenge of cross-lingual visual comprehension.
  • Figure 2: The XPaperQA dataset construction pipeline consists of three parts: (1) Converting PDF papers into metadata using PaddleOCR and generating three QA types via Gemini. (2) Filtering QA pairs with similarity scores $>0.1$ or confidence scores $<7$ to retain distinct pairs. (3) Re-answering the distinct QA pairs through Gemini and discarding inconsistent responses.
  • Figure 3: The entropy distribution of 100 randomly selected examples on the ChartQA dataset in 8 different languages, where the vertical axis represents probability density and the horizontal axis represents the numerical value of entropy. In all 8 languages, the mean and variance of the conditional entropy distribution for correct examples (represented in green) are significantly lower than those for incorrect examples (represented in yellow).
  • Figure 4: Statistics of accuracy and mutual information over 8 different languages on ChartQA dataset. Query in English (same in image text language) performs best, while all other languages have decreased to some extent. Reflects a correlation of accuracy and mutual information.
  • Figure 5: Data example from XT-VQA
  • ...and 3 more figures