Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

Pius Horn; Janis Keuper

Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

Pius Horn, Janis Keuper

Abstract

Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

Abstract

Paper Structure (22 sections, 4 figures, 2 tables)

This paper contains 22 sections, 4 figures, 2 tables.

Introduction
Related Work
PDF Parsing Benchmarks
Table Extraction Evaluation Metrics
Methodology
Benchmark Dataset: Synthetic PDFs with Ground Truth
Evaluation Pipeline: Table Matching
Assessment of Table Evaluation Approaches
Limitations of Rule-based Metrics
LLM-as-a-Judge for Table Evaluation
Human Evaluation Protocol
Correlation with Human Judgment
Experiments and Results
Discussion
LLM scores vs. TEDS.
...and 7 more sections

Figures (4)

Figure 1: Overview of the benchmark generation pipeline. Randomly sampled content blocks and layout templates yield a JSON ground truth, which is assembled into LaTeX and compiled to PDF.
Figure 2: Structural metrics penalize harmless variation while overlooking critical errors. The parser output (b) largely preserves the semantics of (a), yet incurs heavy edit distance from representational differences (structural reorganization, symbol encoding, value equivalence, markup artifact). The only meaning-altering errors---a lost decimal and a sign flip (content error)---barely affect the score.
Figure 3: Scatter plots comparing automated metrics with human scores. Left column: rule-based metrics (TEDS, GriTS-Avg, SCORE-Avg); right column: LLM judges (DeepSeek-v3.2, Gemini-3-Flash-Preview, Claude Opus 4.6). Bubble size indicates point count.
Figure 4: Per-parser score distributions across 451 tables. Each subplot shows the percentage of tables receiving each integer score (0--10); the dashed line marks the mean. Parsers are ordered by mean score (top-left to bottom-right).

Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

Abstract

Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

Authors

Abstract

Table of Contents

Figures (4)