Table of Contents
Fetching ...

SynJAC: Synthetic-data-driven Joint-granular Adaptation and Calibration for Domain Specific Scanned Document Key Information Extraction

Yihao Ding, Soyeon Caren Han, Zechuan Li, Hyunsuk Chung

TL;DR

SynJAC tackles domain-specific key information extraction in visually rich scanned documents by replacing large manual annotation burdens with synthetic data, while preserving accuracy through a joint-granular architecture that fuses fine- and coarse-grained representations via a Layout-to-Vector embedding. It introduces three domain adaptation strategies—Structural Domain Shifting (SDS), Synthetic Sequence Tagging (SST), and Synthetic Instruction Tuning (SIT)—and a guidance-based calibration stage to align synthetic knowledge with a small annotated set. Empirical results across FormNLU and CORD demonstrate substantial gains under few-shot and zero-shot conditions, with robust performance even when synthetic labels are noisy. The work shows measurable improvements from L2V, multi-stage feature fusion, and calibration, offering a scalable path for VRD KIE in real-world, low-label regimes, while acknowledging the remaining challenges of synthetic noise and distribution shifts.

Abstract

Visually Rich Documents (VRDs), comprising elements such as charts, tables, and paragraphs, convey complex information across diverse domains. However, extracting key information from these documents remains labour-intensive, particularly for scanned formats with inconsistent layouts and domain-specific requirements. Despite advances in pretrained models for VRD understanding, their dependence on large annotated datasets for fine-tuning hinders scalability. This paper proposes \textbf{SynJAC} (Synthetic-data-driven Joint-granular Adaptation and Calibration), a method for key information extraction in scanned documents. SynJAC leverages synthetic, machine-generated data for domain adaptation and employs calibration on a small, manually annotated dataset to mitigate noise. By integrating fine-grained and coarse-grained document representation learning, SynJAC significantly reduces the need for extensive manual labelling while achieving competitive performance. Extensive experiments demonstrate its effectiveness in domain-specific and scanned VRD scenarios.

SynJAC: Synthetic-data-driven Joint-granular Adaptation and Calibration for Domain Specific Scanned Document Key Information Extraction

TL;DR

SynJAC tackles domain-specific key information extraction in visually rich scanned documents by replacing large manual annotation burdens with synthetic data, while preserving accuracy through a joint-granular architecture that fuses fine- and coarse-grained representations via a Layout-to-Vector embedding. It introduces three domain adaptation strategies—Structural Domain Shifting (SDS), Synthetic Sequence Tagging (SST), and Synthetic Instruction Tuning (SIT)—and a guidance-based calibration stage to align synthetic knowledge with a small annotated set. Empirical results across FormNLU and CORD demonstrate substantial gains under few-shot and zero-shot conditions, with robust performance even when synthetic labels are noisy. The work shows measurable improvements from L2V, multi-stage feature fusion, and calibration, offering a scalable path for VRD KIE in real-world, low-label regimes, while acknowledging the remaining challenges of synthetic noise and distribution shifts.

Abstract

Visually Rich Documents (VRDs), comprising elements such as charts, tables, and paragraphs, convey complex information across diverse domains. However, extracting key information from these documents remains labour-intensive, particularly for scanned formats with inconsistent layouts and domain-specific requirements. Despite advances in pretrained models for VRD understanding, their dependence on large annotated datasets for fine-tuning hinders scalability. This paper proposes \textbf{SynJAC} (Synthetic-data-driven Joint-granular Adaptation and Calibration), a method for key information extraction in scanned documents. SynJAC leverages synthetic, machine-generated data for domain adaptation and employs calibration on a small, manually annotated dataset to mitigate noise. By integrating fine-grained and coarse-grained document representation learning, SynJAC significantly reduces the need for extensive manual labelling while achieving competitive performance. Extensive experiments demonstrate its effectiveness in domain-specific and scanned VRD scenarios.
Paper Structure (54 sections, 9 equations, 11 figures, 14 tables)

This paper contains 54 sections, 9 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Comparing manual and synthetic structural and task-oriented annotation.
  • Figure 2: Workflow for generating synthetic annotations for domain-specific understanding.
  • Figure 3: The SynJAC framework is built on a Joint-grained Model containing both fine-grained and coarse-grained document representations (left). We introduce three domain adaptation strategies, SDS, SST, and SIT, to enable the joint-grained framework to effectively adapt to the target domain from both structural and task-oriented perspectives.
  • Figure 4: Off-the-shelf-tool analysis. Synthetic-Structure (Syn-Struct) and Synthetic-Text (Syn-Text).
  • Figure 5: Performance of SynJAC with stepped training set ratios on three test sets.
  • ...and 6 more figures