Table of Contents
Fetching ...

DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion

Marcel Lamott, Saifullah Saifullah, Nauman Riaz, Yves-Noel Weweler, Tobias Alt-Veit, Ahmad Sarmad Ali, Muhammad Armaghan Shakir, Adrian Kalwa, Momina Moetesum, Andreas Dengel, Sheraz Ahmed, Faisal Shafait, Ulrich Schwanecke, Adrian Ulges

TL;DR

This is the first work demonstrating that VLMs can generate faithful annotated document datasets at scale from unlabeled seeds that can effectively enrich or approximate real, manually annotated data for diverse document understanding tasks.

Abstract

Effective document intelligence models rely on large amounts of annotated training data. However, procuring sufficient and high-quality data poses significant challenges due to the labor-intensive and costly nature of data acquisition. Additionally, leveraging language models to annotate real documents raises concerns about data privacy. Synthetic document generation has emerged as a promising, privacy-preserving alternative. We propose DocDjinn, a novel framework for controllable synthetic document generation using Vision-Language Models (VLMs) that produces annotated documents from unlabeled seed samples. Our approach generates visually plausible and semantically consistent synthetic documents that follow the distribution of an existing source dataset through clustering-based seed selection with parametrized sampling. By enriching documents with realistic diffusion-based handwriting and contextual visual elements via semantic-visual decoupling, we generate diverse, high-quality annotated synthetic documents. We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis. To our knowledge, this is the first work demonstrating that VLMs can generate faithful annotated document datasets at scale from unlabeled seeds that can effectively enrich or approximate real, manually annotated data for diverse document understanding tasks. We show that with only 100 real training samples, our framework achieves on average $87\%$ of the performance of the full real-world dataset. We publicly release our code and 140k+ synthetic document samples.

DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion

TL;DR

This is the first work demonstrating that VLMs can generate faithful annotated document datasets at scale from unlabeled seeds that can effectively enrich or approximate real, manually annotated data for diverse document understanding tasks.

Abstract

Effective document intelligence models rely on large amounts of annotated training data. However, procuring sufficient and high-quality data poses significant challenges due to the labor-intensive and costly nature of data acquisition. Additionally, leveraging language models to annotate real documents raises concerns about data privacy. Synthetic document generation has emerged as a promising, privacy-preserving alternative. We propose DocDjinn, a novel framework for controllable synthetic document generation using Vision-Language Models (VLMs) that produces annotated documents from unlabeled seed samples. Our approach generates visually plausible and semantically consistent synthetic documents that follow the distribution of an existing source dataset through clustering-based seed selection with parametrized sampling. By enriching documents with realistic diffusion-based handwriting and contextual visual elements via semantic-visual decoupling, we generate diverse, high-quality annotated synthetic documents. We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis. To our knowledge, this is the first work demonstrating that VLMs can generate faithful annotated document datasets at scale from unlabeled seeds that can effectively enrich or approximate real, manually annotated data for diverse document understanding tasks. We show that with only 100 real training samples, our framework achieves on average of the performance of the full real-world dataset. We publicly release our code and 140k+ synthetic document samples.
Paper Structure (48 sections, 2 equations, 33 figures, 11 tables, 1 algorithm)

This paper contains 48 sections, 2 equations, 33 figures, 11 tables, 1 algorithm.

Figures (33)

  • Figure 1: Examples of synthetically generated documents across diverse domains and tasks. Our framework produces documents with realistic layouts, VLM-generated content, diffusion-based handwriting, and contextual visual elements.
  • Figure 2: Overview of DocDjinn for synthetic document generation. After selecting representative seeds from a source dataset, a VLM generates an HTML representation of the document, along with multi-task ground truth information. This representation is enhanced with diffusion-generated handwriting and further visual elements. Finally, the ground truth is updated with bounding boxes and verified.
  • Figure 3: Baseline alignment for sentence-level handwritten text. Top, left to right: input image, word segmentation, lowest-ink pixel per column (in red), and computed baseline via percentile (in blue). Bottom: example sentence-level handwritten text after baseline alignment (red).
  • Figure 4: Overview of our used clusters for all datasets. Each clustering lists embedding type, HDBSCAN CMS13 minimum cluster size $\kappa$ and number of resulting clusters.
  • Figure 5: Clustering results across different embeddings and HDBSCAN CMS13 minimum cluster sizes $\kappa$ for DocVQA -NoValue- MKJ21.
  • ...and 28 more figures