Table of Contents
Fetching ...

Efficient Table Retrieval and Understanding with Multimodal Large Language Models

Zhuoyan Xu, Haoyang Fang, Boran Han, Bonan Min, Bernie Wang, Cuixiong Hu, Shuai Zhang

TL;DR

This paper tackles the practical challenge of answering questions over large collections of table images, bypassing OCR by operating directly on visual table data. It introduces TabRAG, a three-stage pipeline consisting of a bi-encoder retriever, a cross-encoder MLLM reranker, and a generation-capable MLLM, optimized end-to-end with contrastive and instruction-tuning objectives. Evaluated on a MMTab-derived dataset with 88,161 training and 9,819 testing samples across eight benchmarks (48,504 unique tables), TabRAG achieves a 7.0% gain in retrieval recall and a 6.1% gain in answer accuracy over strong baselines, demonstrating robust improvements in retrieval, ranking, and generation. The work offers a practical, scalable solution for real-world table understanding tasks and suggests promising directions for extending to diverse document types and multimodal content beyond tables.

Abstract

Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities. While recent advances in Multimodal Large Language Models (MLLMs) show promising results in table understanding, they typically assume the relevant table is readily available. However, a more practical scenario involves identifying and reasoning over relevant tables from large-scale collections to answer user queries. To address this gap, we propose TabRAG, a framework that enables MLLMs to answer queries over large collections of table images. Our approach first retrieves candidate tables using jointly trained visual-text foundation models, then leverages MLLMs to perform fine-grained reranking of these candidates, and finally employs MLLMs to reason over the selected tables for answer generation. Through extensive experiments on a newly constructed dataset comprising 88,161 training and 9,819 testing samples across 8 benchmarks with 48,504 unique tables, we demonstrate that our framework significantly outperforms existing methods by 7.0% in retrieval recall and 6.1% in answer accuracy, offering a practical solution for real-world table understanding tasks.

Efficient Table Retrieval and Understanding with Multimodal Large Language Models

TL;DR

This paper tackles the practical challenge of answering questions over large collections of table images, bypassing OCR by operating directly on visual table data. It introduces TabRAG, a three-stage pipeline consisting of a bi-encoder retriever, a cross-encoder MLLM reranker, and a generation-capable MLLM, optimized end-to-end with contrastive and instruction-tuning objectives. Evaluated on a MMTab-derived dataset with 88,161 training and 9,819 testing samples across eight benchmarks (48,504 unique tables), TabRAG achieves a 7.0% gain in retrieval recall and a 6.1% gain in answer accuracy over strong baselines, demonstrating robust improvements in retrieval, ranking, and generation. The work offers a practical, scalable solution for real-world table understanding tasks and suggests promising directions for extending to diverse document types and multimodal content beyond tables.

Abstract

Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities. While recent advances in Multimodal Large Language Models (MLLMs) show promising results in table understanding, they typically assume the relevant table is readily available. However, a more practical scenario involves identifying and reasoning over relevant tables from large-scale collections to answer user queries. To address this gap, we propose TabRAG, a framework that enables MLLMs to answer queries over large collections of table images. Our approach first retrieves candidate tables using jointly trained visual-text foundation models, then leverages MLLMs to perform fine-grained reranking of these candidates, and finally employs MLLMs to reason over the selected tables for answer generation. Through extensive experiments on a newly constructed dataset comprising 88,161 training and 9,819 testing samples across 8 benchmarks with 48,504 unique tables, we demonstrate that our framework significantly outperforms existing methods by 7.0% in retrieval recall and 6.1% in answer accuracy, offering a practical solution for real-world table understanding tasks.
Paper Structure (26 sections, 1 equation, 3 figures, 7 tables)

This paper contains 26 sections, 1 equation, 3 figures, 7 tables.

Figures (3)

  • Figure 1: The TabRAG framework, which consists of a retriever, a reranker and a MLLM. Once receiving the general query, retriever identifies the relevant tables. Once having the subset of tables, reranker model will rank the relevance of each image with the query, and select the best ones, then MLLM will take the images selected by reranker and query as input, generate the final results.
  • Figure 2: Retrieval results of different encoders on RAGTab dataset, different curves represents different models. The graphs illustrate both Mean Reciprocal Rank (MRR) and Recall metrics across various top-$k$ retrievals ranging from 1 to 200. (a) The MRR metric on top $k$ results. (b) The Recall metric on top $k$ results.
  • Figure 3: Retrieval results of different encoders on RAGTab dataset, different curves represents only retrievals and retrieval and reranking stage. (a) The MRR metric on top $k$ results. (b) The Recall metric on top $k$ results.