Table of Contents
Fetching ...

Benchmarking Table Extraction from Heterogeneous Scientific Extraction Documents

Marijan Soric, Cécile Gracianne, Ioana Manolescu, Pierre Senellart

TL;DR

The paper tackles end-to-end table extraction (TE) from PDFs by introducing a formal benchmark, new heterogeneous datasets, and rigorous end-to-end evaluation metrics that jointly assess TD and TSR. It analyzes a spectrum of methods—from rule-based libraries to transformer-based DETR-like pipelines and large vision-language models—across three datasets including Table-arXiv and Table-BRGM. Key findings show TE remains challenging due to generalization gaps, robustness limits, and token-level content accuracy, with object-detection–based approaches like DocLing and the proposed TATR/VGT pipelines delivering the strongest overall performance. The work provides a comprehensive, publicly available benchmark to drive reproducible TE research and emphasizes calibrated confidence estimates as a practical requirement for real-world deployment.

Abstract

Table Extraction (TE) consists in extracting tables from PDF documents, in a structured format which can be automatically processed. While numerous TE tools exist, the variety of methods and techniques makes it difficult for users to choose an appropriate one. We propose a novel benchmark for assessing end-to-end TE methods (from PDF to the final table). We contribute an analysis of TE evaluation metrics, and the design of a rigorous evaluation process, which allows scoring each TE sub-task as well as end-to-end TE, and captures model uncertainty. Along with a prior dataset, our benchmark comprises two new heterogeneous datasets of 37k samples. We run our benchmark on diverse models, including off-the-shelf libraries, software tools, large vision language models, and approaches based on computer vision. The results demonstrate that TE remains challenging: current methods suffer from a lack of generalizability when facing heterogeneous data, and from limitations in robustness and interpretability.

Benchmarking Table Extraction from Heterogeneous Scientific Extraction Documents

TL;DR

The paper tackles end-to-end table extraction (TE) from PDFs by introducing a formal benchmark, new heterogeneous datasets, and rigorous end-to-end evaluation metrics that jointly assess TD and TSR. It analyzes a spectrum of methods—from rule-based libraries to transformer-based DETR-like pipelines and large vision-language models—across three datasets including Table-arXiv and Table-BRGM. Key findings show TE remains challenging due to generalization gaps, robustness limits, and token-level content accuracy, with object-detection–based approaches like DocLing and the proposed TATR/VGT pipelines delivering the strongest overall performance. The work provides a comprehensive, publicly available benchmark to drive reproducible TE research and emphasizes calibrated confidence estimates as a practical requirement for real-world deployment.

Abstract

Table Extraction (TE) consists in extracting tables from PDF documents, in a structured format which can be automatically processed. While numerous TE tools exist, the variety of methods and techniques makes it difficult for users to choose an appropriate one. We propose a novel benchmark for assessing end-to-end TE methods (from PDF to the final table). We contribute an analysis of TE evaluation metrics, and the design of a rigorous evaluation process, which allows scoring each TE sub-task as well as end-to-end TE, and captures model uncertainty. Along with a prior dataset, our benchmark comprises two new heterogeneous datasets of 37k samples. We run our benchmark on diverse models, including off-the-shelf libraries, software tools, large vision language models, and approaches based on computer vision. The results demonstrate that TE remains challenging: current methods suffer from a lack of generalizability when facing heterogeneous data, and from limitations in robustness and interpretability.

Paper Structure

This paper contains 56 sections, 12 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Sample table where $(i,j)$ pairs indicate the location of each cell. When omitted, default values for $r$ and $c$ are $0$.
  • Figure 2: TATR-extract pipeline for table extraction
  • Figure 3: TSR inputs depending on the TD model type (with or without a confidence score), and evaluation type (bbox-based or content-based).
  • Figure 4: Precision--Recall curves, depending on IoU (50 % or expected metric) for (left to right): PubTables, Table-arXiv and Table-BRGM dataset (bbox TD).
  • Figure 8: $P^{\rm Top}-R^{\rm Top}$ curves for (left to right): PubTables, Table-arXiv dataset and Table-BRGM (bbox TD).
  • ...and 12 more figures

Theorems & Definitions (2)

  • Definition 1: Expected Precision and Recall
  • Definition 2: TSR Precision and Recall