Table of Contents
Fetching ...

SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding

Chuanghao Ding, Xuejing Liu, Wei Tang, Juan Li, Xiaoliang Wang, Rui Zhao, Cam-Tu Nguyen, Fei Tan

TL;DR

SynthDoc tackles data scarcity in Visual Document Understanding by generating a scalable, bilingual synthetic dataset that combines text, images, tables, and charts. The pipeline splits into layout design (Page/Region/Line controllers) and content rendering (graphics and text), enabling end-to-end pretraining of models like Donut with Swin-Transformer encoders and mBART decoders. A 5,000 image-text benchmark demonstrates robust bilingual parsing and strong downstream performance, validating the effectiveness of synthetic data for multilingual VDU. The work offers a practical, language-agnostic data-generation solution and substantiates the potential of end-to-end document parsing on complex, real-world documents.

Abstract

This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU) by generating high-quality, diverse datasets that include text, images, tables, and charts. Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset. Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies. The release of a benchmark dataset comprising 5,000 image-text pairs not only showcases the pipeline's capabilities but also provides a valuable resource for the VDU community to advance research and development in document image recognition. This work significantly contributes to the field by offering a scalable solution to data scarcity and by validating the efficacy of end-to-end models in parsing complex, real-world documents.

SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding

TL;DR

SynthDoc tackles data scarcity in Visual Document Understanding by generating a scalable, bilingual synthetic dataset that combines text, images, tables, and charts. The pipeline splits into layout design (Page/Region/Line controllers) and content rendering (graphics and text), enabling end-to-end pretraining of models like Donut with Swin-Transformer encoders and mBART decoders. A 5,000 image-text benchmark demonstrates robust bilingual parsing and strong downstream performance, validating the effectiveness of synthetic data for multilingual VDU. The work offers a practical, language-agnostic data-generation solution and substantiates the potential of end-to-end document parsing on complex, real-world documents.

Abstract

This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU) by generating high-quality, diverse datasets that include text, images, tables, and charts. Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset. Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies. The release of a benchmark dataset comprising 5,000 image-text pairs not only showcases the pipeline's capabilities but also provides a valuable resource for the VDU community to advance research and development in document image recognition. This work significantly contributes to the field by offering a scalable solution to data scarcity and by validating the efficacy of end-to-end models in parsing complex, real-world documents.
Paper Structure (27 sections, 6 figures, 2 tables)

This paper contains 27 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The pipeline of Document Image Synthesis, including layout design and content rendering. The layout design involves planning at three scales: full-page, regional, and line-by-line. Content rendering creates both visual graphics and textual content.
  • Figure 2: Gridlined and gridless table renderings.
  • Figure 3: (a) Samples of the synthesized charts: Pie Chart, Vertical Bar Chart, Scatter Chart and Line Chart. (b) The annotation formats corresponding to different charts, which are presented in HTML format.
  • Figure 4: This is an overview architecture to training the model
  • Figure 5: Examples of document image parsing on synthesized document with tables, images, and chart. (a), (b) and (c) stand for the synthetic document images with tables, images, and chart, (d), (e), and (f) represent the parsing results of the model on them, respectively.
  • ...and 1 more figures