LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

Inbum Heo; Taewook Hwang; Jeesu Jung; Sangkeun Jung

LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

Inbum Heo, Taewook Hwang, Jeesu Jung, Sangkeun Jung

Abstract

Recent advances in Large Language Models (LLMs) and Large Multimodal Models (LMMs) have improved Document Layout Analysis (DLA), yet structural errors such as region merging, splitting, and omission remain persistent. Conventional overlap-based metrics (e.g., IoU, mAP) fail to capture such logical inconsistencies. To overcome this limitation, we propose Layout Error Detection (LED), a benchmark that evaluates structural reasoning in DLA predictions beyond surface-level accuracy. LED defines eight standardized error types (Missing, Hallucination, Size Error, Split, Merge, Overlap, Duplicate, and Misclassification) and provides quantitative rules and injection algorithms for realistic error simulation. Using these definitions, we construct LED-Dataset and design three evaluation tasks: document-level error detection, document-level error-type classification, and element-level error-type classification. Experiments with state-of-the-art multimodal models show that LED enables fine-grained and interpretable assessment of structural understanding, revealing clear weaknesses across modalities and architectures. Overall, LED establishes a unified and explainable benchmark for diagnosing the structural robustness and reasoning capability of document understanding models.

LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

Abstract

Paper Structure (24 sections, 4 figures, 2 tables)

This paper contains 24 sections, 4 figures, 2 tables.

Introduction
Related Work
Error in Document Layout Analysis
Document-Specific Error Definition and Injection
Comparison with Existing Error Definitions
LED-Dataset
Synthetic Dataset Generation
Raw Data and Annotation Structure
Dataset Statistics
LED benchmark
Task Definition
Prompting Configuration
Experimental Setup
Model Pool & Size
Implementation & API Setting
...and 9 more sections

Figures (4)

Figure 1: Example from the LED-Dataset showing predicted boxes (red) and ground truth boxes (green). Regions visible only in green indicate missing elements, while red boxes illustrate size discrepancies or split errors within the document layout.
Figure 2: Overview of the proposed Layout Error Detection (LED) framework. LED defines eight structural error types and builds the LED-Dataset by injecting realistic layout errors into DocLayNet pages. It evaluates models through three hierarchical tasks— ($T_1$) document-level error detection, ($T_2$) error type classification, and ($T_3$) element-level error classification— enabling fine-grained and explainable assessment of structural robustness in Document Layout Analysis.
Figure 3: Prompting-wise robustness across models (lower CV/NR = higher stability)
Figure 4: Detection rates for the two most frequent error types in LED ($T_2$ task). Red dashed lines indicate true error distributions.

LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

Abstract

LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

Authors

Abstract

Table of Contents

Figures (4)