Table of Contents
Fetching ...

TabRAG: Tabular Document Retrieval via Structured Language Representations

Jacob Si, Mike Qu, Michelle Lee, Yingzhen Li

TL;DR

TabRAG tackles the difficulty of applying retrieval-augmented generation to tabular documents by avoiding expensive embedding fine-tuning and improving parsing-based extraction. It introduces a parsing-based pipeline that first detects page layout into regions, uses a vision-language model to extract structured JSON representations of table cells with their headers, and then leverages a large language model to convert these into embedding-friendly natural descriptions for retrieval. The approach bridges structured and unstructured modalities by generating region rationales and encoding them with a dedicated embedding backbone, enabling robust retrieval and improved generation outcomes. Empirical results on multiple tabular QA benchmarks show significant gains in generation accuracy and L3Score, with competitive retrieval performance and reasonable compute. The work demonstrates a practical path to scaling tabular understanding in RAG without expensive fine-tuning, and the authors provide open-source code.

Abstract

Ingesting data for Retrieval-Augmented Generation (RAG) involves either fine-tuning the embedding model directly on the target corpus or parsing documents for embedding model encoding. The former, while accurate, incurs high computational hardware requirements, while the latter suffers from suboptimal performance when extracting tabular data. In this work, we address the latter by presenting TabRAG, a parsing-based RAG pipeline designed to tackle table-heavy documents via structured language representations. TabRAG outperforms existing popular parsing-based methods for generation and retrieval. Code is available at https://github.com/jacobyhsi/TabRAG.

TabRAG: Tabular Document Retrieval via Structured Language Representations

TL;DR

TabRAG tackles the difficulty of applying retrieval-augmented generation to tabular documents by avoiding expensive embedding fine-tuning and improving parsing-based extraction. It introduces a parsing-based pipeline that first detects page layout into regions, uses a vision-language model to extract structured JSON representations of table cells with their headers, and then leverages a large language model to convert these into embedding-friendly natural descriptions for retrieval. The approach bridges structured and unstructured modalities by generating region rationales and encoding them with a dedicated embedding backbone, enabling robust retrieval and improved generation outcomes. Empirical results on multiple tabular QA benchmarks show significant gains in generation accuracy and L3Score, with competitive retrieval performance and reasonable compute. The work demonstrates a practical path to scaling tabular understanding in RAG without expensive fine-tuning, and the authors provide open-source code.

Abstract

Ingesting data for Retrieval-Augmented Generation (RAG) involves either fine-tuning the embedding model directly on the target corpus or parsing documents for embedding model encoding. The former, while accurate, incurs high computational hardware requirements, while the latter suffers from suboptimal performance when extracting tabular data. In this work, we address the latter by presenting TabRAG, a parsing-based RAG pipeline designed to tackle table-heavy documents via structured language representations. TabRAG outperforms existing popular parsing-based methods for generation and retrieval. Code is available at https://github.com/jacobyhsi/TabRAG.

Paper Structure

This paper contains 17 sections, 1 equation, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: The TabRAG Architecture, a parsing-based RAG pipeline designed specifically for tables. First, a layout detection model is applied to segment various components from the documents. Specifically, the tables are then passed into a vision language model, which extracts cell values along with their corresponding column and row names in a structured representation. Lastly, the structured representation is inputted into a language model that generates natural language descriptions.
  • Figure 2: TabRAG Framework