ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

Nick Ferguson; Josh Pennington; Narek Beghian; Aravind Mohan; Douwe Kiela; Sheshansh Agrawal; Thien Hang Nguyen

ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

Nick Ferguson, Josh Pennington, Narek Beghian, Aravind Mohan, Douwe Kiela, Sheshansh Agrawal, Thien Hang Nguyen

TL;DR

ExtractBench addresses the challenge of end-to-end PDF-to-JSON structured extraction for enterprise-scale schemas by introducing a diagnostic dataset of PDF–Schema–gold JSON triplets across five domains and a schema-driven evaluation framework that captures per-field semantics, three-value value states, and nested-array alignment. The framework treats the JSON Schema as an executable specification and provides a plugin-based metric library for exact, fuzzy, semantic, and numeric-tolerance evaluations, along with semantic array matching. Benchmark results across frontier LLMs show reliability degrades sharply as schema breadth increases (e.g., 369 fields in SEC 10-K/Qs yielding 0% valid outputs), with valid extractions achieving only 65–80% field-level accuracy on the non-fatal cases and a low end-to-end pass rate overall. The work highlights that output volume, not input length or nesting depth, is the dominant predictor of failure and provides open-source tooling to drive progress toward production-ready enterprise extraction through dataset diversification, retrieval or decomposition strategies, and agentic approaches.

Abstract

Unstructured documents like PDFs contain valuable structured information, but downstream systems require this data in reliable, standardized formats. LLMs are increasingly deployed to automate this extraction, making accuracy and reliability paramount. However, progress is bottlenecked by two gaps. First, no end-to-end benchmark evaluates PDF-to-JSON extraction under enterprise-scale schema breadth. Second, no principled methodology captures the semantics of nested extraction, where fields demand different notions of correctness (exact match for identifiers, tolerance for quantities, semantic equivalence for names), arrays require alignment, and omission must be distinguished from hallucination. We address both gaps with ExtractBench, an open-source benchmark and evaluation framework for PDF-to-JSON structured extraction. The benchmark pairs 35 PDF documents with JSON Schemas and human-annotated gold labels across economically valuable domains, yielding 12,867 evaluatable fields spanning schema complexities from tens to hundreds of fields. The evaluation framework treats the schema as an executable specification: each field declares its scoring metric. Baseline evaluations reveal that frontier models (GPT-5/5.2, Gemini-3 Flash/Pro, Claude 4.5 Opus/Sonnet) remain unreliable on realistic schemas. Performance degrades sharply with schema breadth, culminating in 0% valid output on a 369-field financial reporting schema across all tested models. We release ExtractBench at https://github.com/ContextualAI/extract-bench.

ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

TL;DR

Abstract

Paper Structure (38 sections, 2 figures, 13 tables)

This paper contains 38 sections, 2 figures, 13 tables.

Introduction
Related Work
Document understanding and enterprise document IE benchmarks
Structured output generation and constrained decoding
Nested JSON extraction and evaluation methodology
PDF processing and upstream document parsing
The ExtractBench Dataset
Design Philosophy
Domains and Complexity Dimensions
Schema complexity dimensions.
Annotation Process
A Principled Evaluation Methodology
The Evaluation Challenge
Schema-Driven Evaluation
Metric library.
...and 23 more sections

Figures (2)

Figure 1: Overview of the financial data schema structure showing (a) main sections, (b) key metric fields, (c) JSON schema reference example, (d) growth metric definition with 17 fields, and (e) leaf type definitions.
Figure 2: Token statistics by domain. Left: average input (PDF) and output (JSON) token counts. Right: compression ratio (input$\div$output). Credit agreements show 80$\times$ compression (needle-in-haystack extraction from long documents), while research papers show $<$1$\times$ (output exceeds input due to citation array expansion).

ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

TL;DR

Abstract

ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (2)