Same Content, Different Representations: A Controlled Study for Table QA
Yue Zhang, Seiji Maekawa, Nikita Bhutani
TL;DR
The paper investigates how table representation—structured versus semi-structured—affects question answering over tables by introducing RePairTQA, a controlled diagnostic benchmark that pairs representations while varying four reasoning-difficulty factors. It benchmarks NL2SQL, LLM-based, and hybrid methods, revealing that representation largely governs performance: NL2SQL excels on structured data but falters on semi-structured tables, LLMs are robust yet less precise, and hybrids strike a balance, especially with noisy schemas. The work shows that table size, joins, and query complexity further modulate these trends, with schema quality significantly impacting SQL-based methods. The findings underscore the need for representation-aware system design and pave the way for robust hybrid approaches in diverse real-world data formats.
Abstract
Table Question Answering (Table QA) in real-world settings must operate over both structured databases and semi-structured tables containing textual fields. However, existing benchmarks are tied to fixed data formats and have not systematically examined how representation itself affects model performance. We present the first controlled study that isolates the role of table representation by holding content constant while varying structure. Using a verbalization pipeline, we generate paired structured and semi-structured tables, enabling direct comparisons across modeling paradigms. To support detailed analysis, we introduce RePairTQA, a diagnostic benchmark with splits along table size, join requirements, query complexity, and schema quality. Our experiments reveal consistent trade-offs: SQL-based methods achieve high accuracy on structured inputs but degrade on semi-structured data, LLMs exhibit flexibility but reduced precision, and hybrid approaches strike a balance, particularly under noisy schemas. These effects intensify with larger tables and more complex queries. Ultimately, no single method excels across all conditions, and we highlight the central role of representation in shaping Table QA performance. Our findings provide actionable insights for model selection and design, paving the way for more robust hybrid approaches suited for diverse real-world data formats.
