SynJAC: Synthetic-data-driven Joint-granular Adaptation and Calibration for Domain Specific Scanned Document Key Information Extraction
Yihao Ding, Soyeon Caren Han, Zechuan Li, Hyunsuk Chung
TL;DR
SynJAC tackles domain-specific key information extraction in visually rich scanned documents by replacing large manual annotation burdens with synthetic data, while preserving accuracy through a joint-granular architecture that fuses fine- and coarse-grained representations via a Layout-to-Vector embedding. It introduces three domain adaptation strategies—Structural Domain Shifting (SDS), Synthetic Sequence Tagging (SST), and Synthetic Instruction Tuning (SIT)—and a guidance-based calibration stage to align synthetic knowledge with a small annotated set. Empirical results across FormNLU and CORD demonstrate substantial gains under few-shot and zero-shot conditions, with robust performance even when synthetic labels are noisy. The work shows measurable improvements from L2V, multi-stage feature fusion, and calibration, offering a scalable path for VRD KIE in real-world, low-label regimes, while acknowledging the remaining challenges of synthetic noise and distribution shifts.
Abstract
Visually Rich Documents (VRDs), comprising elements such as charts, tables, and paragraphs, convey complex information across diverse domains. However, extracting key information from these documents remains labour-intensive, particularly for scanned formats with inconsistent layouts and domain-specific requirements. Despite advances in pretrained models for VRD understanding, their dependence on large annotated datasets for fine-tuning hinders scalability. This paper proposes \textbf{SynJAC} (Synthetic-data-driven Joint-granular Adaptation and Calibration), a method for key information extraction in scanned documents. SynJAC leverages synthetic, machine-generated data for domain adaptation and employs calibration on a small, manually annotated dataset to mitigate noise. By integrating fine-grained and coarse-grained document representation learning, SynJAC significantly reduces the need for extensive manual labelling while achieving competitive performance. Extensive experiments demonstrate its effectiveness in domain-specific and scanned VRD scenarios.
