Table of Contents
Fetching ...

TableRAG: Million-Token Table Understanding with Language Models

Si-An Chen, Lesly Miculicich, Julian Martin Eisenschlos, Zifeng Wang, Zilong Wang, Yanfei Chen, Yasuhisa Fujii, Hsuan-Tien Lin, Chen-Yu Lee, Tomas Pfister

TL;DR

TableRAG presents a scalable retrieval-augmented framework for large-scale table understanding, combining tabular query expansion with schema and cell retrieval to drastically reduce input prompt size while preserving reasoning capabilities. By encoding only a small, frequency-aware subset of schema and cell information and leveraging a program-aided LM, it achieves state-of-the-art performance on million-token tables across ArcadeQA, BirdQA, and synthetic TabFact, while reducing token costs. The work also introduces two real-world million-token benchmarks and provides thorough ablations showing the benefits of query expansion and retrieval components over baseline strategies. Overall, TableRAG offers a practical path to robust, scalable LM-based table QA that scales beyond conventional context-length constraints. The approach has potential implications for real-world data analytics and large-scale data QA tasks where tables are too large to reason about directly.

Abstract

Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's effectiveness at scale. Our results demonstrate that TableRAG's retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.

TableRAG: Million-Token Table Understanding with Language Models

TL;DR

TableRAG presents a scalable retrieval-augmented framework for large-scale table understanding, combining tabular query expansion with schema and cell retrieval to drastically reduce input prompt size while preserving reasoning capabilities. By encoding only a small, frequency-aware subset of schema and cell information and leveraging a program-aided LM, it achieves state-of-the-art performance on million-token tables across ArcadeQA, BirdQA, and synthetic TabFact, while reducing token costs. The work also introduces two real-world million-token benchmarks and provides thorough ablations showing the benefits of query expansion and retrieval components over baseline strategies. Overall, TableRAG offers a practical path to robust, scalable LM-based table QA that scales beyond conventional context-length constraints. The approach has potential implications for real-world data analytics and large-scale data QA tasks where tables are too large to reason about directly.

Abstract

Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's effectiveness at scale. Our results demonstrate that TableRAG's retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
Paper Structure (41 sections, 11 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison between table prompting techniques for LMs. (a) - (d): Data included in the LM prompt (shaded region). (a) Read Table: The LM reads the entire table, which is often infeasible for large tables. (b) Read Schema: The LM reads only the schema, consisting of column names and data types, resulting in a loss of information from the table content. (c) Row-Column Retrieval: Rows and columns are encoded and then selected based on their similarity to the question. Only the intersection of these rows and columns is presented to the LM. It is still infeasible to encode all rows and columns for large tables. (d) Schema-Cell Retrieval (our work): Column names and cells are encoded and retrieved based on their relevance to LM-generated queries about the question. Only the retrieved schema and cells are provided to the LM, enhancing efficiency in both encoding and reasoning. (e) Retrieval results on the ArcadeQA dataset show that TableRAG outperforms other methods in both column and cell retrieval, thereby enhancing the subsequent table reasoning process. The Read Table technique is excluded as reading entire tables is typically infeasible in this context.
  • Figure 2: Workflow of the TableRAG Framework. The table is utilized to build the schema and cell databases. A question is then expanded into multiple schema and cell queries by LMs. These queries are sequentially utilized to retrieve schemas and column-cell pairs. The top $K$ candidates from each query are combined and fed into the LM solver's prompt to answer the question. The pseudocode and an answering example on ArcadeQA can be found in Alg. \ref{['alg:tablerag']} and Fig. \ref{['fig:example']} respectively.
  • Figure 3: Histogram of the proportion of number of distinct values to number of cells in ArcadeQA and BirdQA. The figure indicates that for most tables, the number of distinct values ($D$) are much smaller than the number of cells ($NM$).
  • Figure 4: Performance evaluation of Synthetic Tabfact in varying scales. TableRAG shows consistently superior results, and it decreases gracefully compared to competitive methods.
  • Figure 5: Impact of varying top retrieval results ($K$). Different $K$ values influence both prompt length and accuracy. Each point is labeled with its corresponding $K$ value. TableRAG retrieves the top $K$ schema and cell values, RandRowSampling selects $K$ random rows, and RowColRetrieval retrieves $K$ rows and $K$ columns.
  • ...and 6 more figures