TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning
Xiaohan Yu, Pu Jian, Chong Chen
TL;DR
TableRAG introduces an SQL-based retrieval-and-execution framework to address reasoning over heterogeneous documents containing text and tables. It combines offline database construction with an online iterative reasoning loop consisting of context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. To evaluate multi-hop heterogeneous reasoning, the authors propose HeteQA, a 304-example benchmark across nine domains with five tabular operations. Empirical results show state-of-the-art performance on HybridQA, WikiTableQuestion, and HeteQA, supported by efficiency gains from reduced iterations and symbolic table reasoning. This work demonstrates the importance of integrating structured SQL-based reasoning with natural language retrieval for robust heterogeneous document QA.
Abstract
Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an SQL-based framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
