Table of Contents
Fetching ...

Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

Anshul Singh, Rohan Chaudhary, Gagneet Singh, Abhay Kumary

TL;DR

MirageTVQA addresses a gap where tabular QA benchmarks either ignore visual appearance or are English-centric by introducing a large-scale, visually noisy, multilingual VLM benchmark. The authors collect 3,000 English tables, translate them into 24 languages via a multi-stage pipeline, render with 40+ styles and noise, and generate 80,520 QA pairs across 244 tables and 30 languages, then evaluate open-source VLMs under clean and noisy conditions. Key findings show a pronounced drop in performance when visual noise is present (e.g., from $25.52\%$ to $16.50\%$ EM on English for top models, a relative decrease exceeding $35.3\%$) and a persistent English-first bias with limited cross-lingual transfer. The benchmark enables robust assessment of real-world table reasoning and highlights directions toward more robust, multilingual VLMs.

Abstract

The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35\% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.

Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

TL;DR

MirageTVQA addresses a gap where tabular QA benchmarks either ignore visual appearance or are English-centric by introducing a large-scale, visually noisy, multilingual VLM benchmark. The authors collect 3,000 English tables, translate them into 24 languages via a multi-stage pipeline, render with 40+ styles and noise, and generate 80,520 QA pairs across 244 tables and 30 languages, then evaluate open-source VLMs under clean and noisy conditions. Key findings show a pronounced drop in performance when visual noise is present (e.g., from to EM on English for top models, a relative decrease exceeding ) and a persistent English-first bias with limited cross-lingual transfer. The benchmark enables robust assessment of real-world table reasoning and highlights directions toward more robust, multilingual VLMs.

Abstract

The impressive performance of VLMs is largely measured on benchmarks that fail to capture the complexities of real-world scenarios. Existing datasets for tabular QA, such as WikiTableQuestions and FinQA, are overwhelmingly monolingual (English) and present tables in a digitally perfect, clean format. This creates a significant gap between research and practice. To address this, we present \textbf{MirageTVQA}, a new benchmark designed to evaluate VLMs on these exact dimensions. Featuring nearly 60,000 QA pairs across 24 languages, MirageTVQA challenges models with tables that are not only multilingual but also visually imperfect, incorporating realistic noise to mimic scanned documents. Our evaluation of the leading VLMs reveals two primary failure points: a severe degradation in performance (over 35\% drop for the best models) when faced with visual noise and a consistent English-first bias where reasoning abilities fail to transfer to other languages. MirageTVQA provides a benchmark for measuring and driving progress towards more robust VLM models for table reasoning. The dataset and the code are available at: https://github.com/anshulsc/MirageTVQA.

Paper Structure

This paper contains 15 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Per-language F1 score performance on MirageTVQA across different model scales.
  • Figure 2: LLM prompt for multilingual QA pair translation. Placeholders like {target_language} and {context_table_json} represent actual input data provided to the model.
  • Figure 3: LLM prompt for automated QA pair generation. Placeholders like {table_as_json_string} represent the actual table data provided to the model.