VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Udi Barzelay; Ophir Azulai; Inbar Shapira; Idan Friedman; Foad Abo Dahood; Madison Lee; Abraham Daniels

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Udi Barzelay, Ophir Azulai, Inbar Shapira, Idan Friedman, Foad Abo Dahood, Madison Lee, Abraham Daniels

Abstract

We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy -- a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance -- not extraction capability -- is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Abstract

Paper Structure (17 sections, 11 figures, 5 tables)

This paper contains 17 sections, 11 figures, 5 tables.

Introduction
Related Work
The VAREX Benchmark
Reverse Annotation Pipeline
Dataset Composition
Quality Assurance
Evaluation Protocol
Metrics.
Results and Analysis
Main Results
Structure-Aware Difficulty
Modality Analysis
Output Compliance in Small Models
Scaling, Fine-Tuning, and Modality Preference
Resolution Robustness
...and 2 more sections

Figures (11)

Figure 1: Varex benchmark overview. A government form (a) is paired with a per-document JSON schema (b) that defines the extraction target, including nested structures via $ref. Forms are programmatically filled with realistic data, and ground truth (c) is derived directly from the fill values. The benchmark spans 1,777 documents with 1,771 unique schemas.
Figure 2: The Reverse Annotation pipeline. Stage 1: Fillable PDF templates are filled with deterministic placeholders (TXT_001, TXT_002, …). Stage 2: An LLM discovers a semantic schema by mapping placeholders to field names. Stage 3: Realistic synthetic values replace placeholders and are injected into form widgets. Stage 4: Each filled document is exported in four modalities.
Figure 3: Dataset and evaluation overview. (a) Distribution of extraction fields per document (median 11). (b) Field-level EM% by semantic category for four representative models on Image (V); fields are grouped into nine categories by keyword matching (e.g., Name includes applicant_name, witness_name, etc.), covering 74% of 21,084 fields; email and monetary values show the widest cross-scale gaps (15--17 pp). (c) Number of vision models (out of 18) achieving perfect extraction per document; 91 documents (5%) receive imperfect scores from all models, largely attributable to residual annotation noise (see \ref{['sec:main_results']}).
Figure 4: Output compliance failures in small models (Image V). Dark bars: actual EM; hatched bars: EM on compliant documents. Failures include schema reproduction (dominant in InternVL3.5 1B) and schema-wrapped extraction (dominant in Qwen3-VL 2B). Qwen3-VL 2B drops from 91.5% to 34.2%; InternVL3.5 1B from 72.7% to 28.2%. NuExtract and h2oVL show no gap.
Figure 5: Example Varex document (Nested category). The schema uses $defs/$ref to define reusable nested object types (HousingProvider, IndividualInCharge, HousingFacilityMailingAddress). Only the English-language fields contain fillable widgets; the Spanish translation serves as static context.
...and 6 more figures

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Abstract

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

Authors

Abstract

Table of Contents

Figures (11)