Table of Contents
Fetching ...

LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

TL;DR

This work tackles label-efficient document layout analysis by fusing visual detections with structural priors from text-pretrained LLMs. It introduces an adaptive inverse-variance fusion framework that combines teacher predictions with LLM inferences, enhanced by cross-modal consistency and a curriculum-driven training objective. The approach delivers strong results on PubLayNet and DocLayNet, achieving 88.2 AP with 5% labels on a lightweight SwiftFormer backbone and 89.7 AP with a LayoutLMv3 teacher, while providing theoretical PAC-style guarantees and practical privacy-preserving deployment options via open-source LLMs. Overall, the method demonstrates that LLM-derived structural knowledge is complementary to both lightweight and pretrained document models, enabling substantial gains in label efficiency and robust cross-domain performance.

Abstract

Document layout understanding remains data-intensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined pseudo-labels.Our method demonstrates consistent gains across model scales. With a lightweight SwiftFormer backbone (26M params), we achieve 88.2$\pm$0.3 AP using only 5\% labels on PubLayNet. When applied to document-pretrained LayoutLMv3 (133M params), our fusion framework reaches 89.7$\pm$0.4 AP, surpassing both LayoutLMv3 with standard semi-supervised learning (89.1$\pm$0.4 AP, p=0.02) and matching UDOP~\cite{udop} (89.8 AP) which requires 100M+ pages of multimodal pretraining. This demonstrates that LLM structural priors are complementary to both lightweight and pretrained architectures. Key findings include: (1) learned instance-adaptive gating improves over fixed weights by +0.9 AP with data-dependent PAC bounds correctly predicting convergence; (2) open-source LLMs enable privacy-preserving deployment with minimal loss (Llama-3-70B: 87.1 AP lightweight, 89.4 AP with LayoutLMv3); (3) LLMs provide targeted semantic disambiguation (18.7\% of cases, +3.8 AP gain) beyond simple text heuristics.Total system cost includes \$12 for GPT-4o-mini API or 17 GPU-hours for local Llama-3-70B per 50K pages, amortized across training runs.

LLM-Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis

TL;DR

This work tackles label-efficient document layout analysis by fusing visual detections with structural priors from text-pretrained LLMs. It introduces an adaptive inverse-variance fusion framework that combines teacher predictions with LLM inferences, enhanced by cross-modal consistency and a curriculum-driven training objective. The approach delivers strong results on PubLayNet and DocLayNet, achieving 88.2 AP with 5% labels on a lightweight SwiftFormer backbone and 89.7 AP with a LayoutLMv3 teacher, while providing theoretical PAC-style guarantees and practical privacy-preserving deployment options via open-source LLMs. Overall, the method demonstrates that LLM-derived structural knowledge is complementary to both lightweight and pretrained document models, enabling substantial gains in label efficiency and robust cross-domain performance.

Abstract

Document layout understanding remains data-intensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined pseudo-labels.Our method demonstrates consistent gains across model scales. With a lightweight SwiftFormer backbone (26M params), we achieve 88.20.3 AP using only 5\% labels on PubLayNet. When applied to document-pretrained LayoutLMv3 (133M params), our fusion framework reaches 89.70.4 AP, surpassing both LayoutLMv3 with standard semi-supervised learning (89.10.4 AP, p=0.02) and matching UDOP~\cite{udop} (89.8 AP) which requires 100M+ pages of multimodal pretraining. This demonstrates that LLM structural priors are complementary to both lightweight and pretrained architectures. Key findings include: (1) learned instance-adaptive gating improves over fixed weights by +0.9 AP with data-dependent PAC bounds correctly predicting convergence; (2) open-source LLMs enable privacy-preserving deployment with minimal loss (Llama-3-70B: 87.1 AP lightweight, 89.4 AP with LayoutLMv3); (3) LLMs provide targeted semantic disambiguation (18.7\% of cases, +3.8 AP gain) beyond simple text heuristics.Total system cost includes \$12 for GPT-4o-mini API or 17 GPU-hours for local Llama-3-70B per 50K pages, amortized across training runs.

Paper Structure

This paper contains 46 sections, 3 theorems, 29 equations, 4 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

We apply classical optimal estimation results. Under assumptions A1 (unbiased predictors) and bounded error correlation $\rho \leq \rho_{\max} < 1$, the optimal linear fusion $\hat{y}_f = \alpha^\star\hat{y}_t + (1-\alpha^\star)\hat{y}_l$ with $\alpha^\star = \frac{\sigma_l^2 - \rho\sigma_t\sigma_l}

Figures (4)

  • Figure 1: LLM-guided semi-supervised framework. OCR text is sent to LLM for structural inference; teacher detector generates visual predictions. Fused via IoU matching and confidence weighting. Student trains on refined pseudo-labels.
  • Figure 2: Qualitative examples: (a) Class correction, (b) Localization refinement, (c) Confidence boosting.
  • Figure 3: Label efficiency on PubLayNet. Near-supervised performance at 10% labels.
  • Figure 4: Adaptive gating vs OCR confidence. Learned gate down-weights text as quality degrades.

Theorems & Definitions (8)

  • Theorem 1: Standard Inverse-Variance Weighting
  • Theorem 2: Data-Dependent Generalization Bound
  • proof : Proof Sketch: Structural Rademacher Complexity
  • Corollary 1: Regime Boundary Analysis
  • proof : Proof Sketch
  • proof
  • proof
  • proof