Table of Contents
Fetching ...

TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning

Xiaohan Yu, Pu Jian, Chong Chen

TL;DR

TableRAG introduces an SQL-based retrieval-and-execution framework to address reasoning over heterogeneous documents containing text and tables. It combines offline database construction with an online iterative reasoning loop consisting of context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. To evaluate multi-hop heterogeneous reasoning, the authors propose HeteQA, a 304-example benchmark across nine domains with five tabular operations. Empirical results show state-of-the-art performance on HybridQA, WikiTableQuestion, and HeteQA, supported by efficiency gains from reduced iterations and symbolic table reasoning. This work demonstrates the importance of integrating structured SQL-based reasoning with natural language retrieval for robust heterogeneous document QA.

Abstract

Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an SQL-based framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.

TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning

TL;DR

TableRAG introduces an SQL-based retrieval-and-execution framework to address reasoning over heterogeneous documents containing text and tables. It combines offline database construction with an online iterative reasoning loop consisting of context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. To evaluate multi-hop heterogeneous reasoning, the authors propose HeteQA, a 304-example benchmark across nine domains with five tabular operations. Empirical results show state-of-the-art performance on HybridQA, WikiTableQuestion, and HeteQA, supported by efficiency gains from reduced iterations and symbolic table reasoning. This work demonstrates the importance of integrating structured SQL-based reasoning with natural language retrieval for robust heterogeneous document QA.

Abstract

Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an SQL-based framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.

Paper Structure

This paper contains 57 sections, 5 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: An example of the heterogeneous document based question answering task.
  • Figure 2:
  • Figure 3: Domain distribution and tabular operation distribution of HeteQA.
  • Figure 4: Ablation study on HybridQA and HeteQA benchmarks based on DeepSeek-V3 backbone.
  • Figure 5: Comparison of the execution iterations on HeteQA between TableRAG, ReAct and TableGPT2.
  • ...and 6 more figures