TabRAG: Tabular Document Retrieval via Structured Language Representations
Jacob Si, Mike Qu, Michelle Lee, Yingzhen Li
TL;DR
TabRAG tackles the difficulty of applying retrieval-augmented generation to tabular documents by avoiding expensive embedding fine-tuning and improving parsing-based extraction. It introduces a parsing-based pipeline that first detects page layout into regions, uses a vision-language model to extract structured JSON representations of table cells with their headers, and then leverages a large language model to convert these into embedding-friendly natural descriptions for retrieval. The approach bridges structured and unstructured modalities by generating region rationales and encoding them with a dedicated embedding backbone, enabling robust retrieval and improved generation outcomes. Empirical results on multiple tabular QA benchmarks show significant gains in generation accuracy and L3Score, with competitive retrieval performance and reasonable compute. The work demonstrates a practical path to scaling tabular understanding in RAG without expensive fine-tuning, and the authors provide open-source code.
Abstract
Ingesting data for Retrieval-Augmented Generation (RAG) involves either fine-tuning the embedding model directly on the target corpus or parsing documents for embedding model encoding. The former, while accurate, incurs high computational hardware requirements, while the latter suffers from suboptimal performance when extracting tabular data. In this work, we address the latter by presenting TabRAG, a parsing-based RAG pipeline designed to tackle table-heavy documents via structured language representations. TabRAG outperforms existing popular parsing-based methods for generation and retrieval. Code is available at https://github.com/jacobyhsi/TabRAG.
