MULTITAT: Benchmarking Multilingual Table-and-Text Question Answering
Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che
TL;DR
This work introduces MultiTAT, the first multilingual benchmark for TATQA, addressing the English-centric nature of prior datasets by sampling from HybridQA, TAT-QA, and SciTAT and translating instances into 11 languages. To bridge English TATQA capabilities to non-English contexts, the authors propose Ours, a two-module baseline consisting of Linking (cross-language information retrieval) and Reasoning (English-language program generation). Empirical results show a substantial non-English performance gap (approximately 19.4%), with Ours achieving an average improvement of about 3.3 EM/F1 over baselines, though all models still face significant challenges (EM/F1 below 40). The paper also provides extensive analyses on prompt language, cross-lingual settings, answer sources/types, and error modes, offering actionable insights for advancing multilingual TATQA and highlighting the important role of linking and cross-lingual reasoning in hybrid table-and-text QA.
Abstract
Question answering on the hybrid context of tables and text (TATQA) is a critical task, with broad applications in data-intensive domains. However, existing TATQA datasets are limited to English, leading to several drawbacks: (i) They overlook the challenges of multilingual TAT-QA and cannot assess model performance in the multilingual setting. (ii) They do not reflect real-world scenarios where tables and texts frequently appear in non-English languages. To address the limitations, we propose the first multilingual TATQA dataset (MULTITAT). Specifically, we sample data from 3 mainstream TATQA datasets and translate it into 10 diverse languages. To align the model TATQA capabilities in English with other languages, we develop a baseline, Ours. Experimental results reveal that the performance on non-English data in MULTITAT drops by an average of 19.4% compared to English, proving the necessity of MULTITAT. We further analyze the reasons for this performance gap. Furthermore, Ours outperforms other baselines by an average of 3.3, demonstrating its effectiveness.
