Table of Contents
Fetching ...

CREPE: Coordinate-Aware End-to-End Document Parser

Yamato Okamoto, Youngmin Baek, Geewook Kim, Ryota Nakao, DongHyun Kim, Moon Bin Yim, Seunghyun Park, Bado Lee

TL;DR

CREPE tackles visual document understanding without relying on OCR by jointly producing parsing outputs and text coordinates from document images. It introduces a coordinate head within a multi-head Transformer decoder, where coordinates are decoded only when the special token </ocr> is generated, and coordinates are normalized to the input image size; the architecture supports both bounding boxes ($x_{\min}$, $y_{\min}$, $x_{\max}$, $y_{\max}$) and quadrilaterals ($x_1,y_1$–$x_4,y_4$). A weakly supervised learning framework enables training with parsing annotations only by mixing synthetic OCR tasks and real parsing data and using selective loss masking and standard losses for sequence and coordinates, along with Distance Box IoU losses for better localization. Empirically, CREPE achieves state-of-the-art parsing on CORD, POIE, and FUNSD within the same modality, demonstrates credible text-coordinate extraction (via CLEval), and shows adaptability to document layout analysis, DocVQA, and scene understanding tasks, thereby reducing dependence on external OCR and enabling broader end-to-end document understanding capabilities.

Abstract

In this study, we formulate an OCR-free sequence generation model for visual document understanding (VDU). Our model not only parses text from document images but also extracts the spatial coordinates of the text based on the multi-head architecture. Named as Coordinate-aware End-to-end Document Parser (CREPE), our method uniquely integrates these capabilities by introducing a special token for OCR text, and token-triggered coordinate decoding. We also proposed a weakly-supervised framework for cost-efficient training, requiring only parsing annotations without high-cost coordinate annotations. Our experimental evaluations demonstrate CREPE's state-of-the-art performances on document parsing tasks. Beyond that, CREPE's adaptability is further highlighted by its successful usage in other document understanding tasks such as layout analysis, document visual question answering, and so one. CREPE's abilities including OCR and semantic parsing not only mitigate error propagation issues in existing OCR-dependent methods, it also significantly enhance the functionality of sequence generation models, ushering in a new era for document understanding studies.

CREPE: Coordinate-Aware End-to-End Document Parser

TL;DR

CREPE tackles visual document understanding without relying on OCR by jointly producing parsing outputs and text coordinates from document images. It introduces a coordinate head within a multi-head Transformer decoder, where coordinates are decoded only when the special token </ocr> is generated, and coordinates are normalized to the input image size; the architecture supports both bounding boxes (, , , ) and quadrilaterals (). A weakly supervised learning framework enables training with parsing annotations only by mixing synthetic OCR tasks and real parsing data and using selective loss masking and standard losses for sequence and coordinates, along with Distance Box IoU losses for better localization. Empirically, CREPE achieves state-of-the-art parsing on CORD, POIE, and FUNSD within the same modality, demonstrates credible text-coordinate extraction (via CLEval), and shows adaptability to document layout analysis, DocVQA, and scene understanding tasks, thereby reducing dependence on external OCR and enabling broader end-to-end document understanding capabilities.

Abstract

In this study, we formulate an OCR-free sequence generation model for visual document understanding (VDU). Our model not only parses text from document images but also extracts the spatial coordinates of the text based on the multi-head architecture. Named as Coordinate-aware End-to-end Document Parser (CREPE), our method uniquely integrates these capabilities by introducing a special token for OCR text, and token-triggered coordinate decoding. We also proposed a weakly-supervised framework for cost-efficient training, requiring only parsing annotations without high-cost coordinate annotations. Our experimental evaluations demonstrate CREPE's state-of-the-art performances on document parsing tasks. Beyond that, CREPE's adaptability is further highlighted by its successful usage in other document understanding tasks such as layout analysis, document visual question answering, and so one. CREPE's abilities including OCR and semantic parsing not only mitigate error propagation issues in existing OCR-dependent methods, it also significantly enhance the functionality of sequence generation models, ushering in a new era for document understanding studies.
Paper Structure (21 sections, 11 figures, 2 tables)

This paper contains 21 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Examples of Parsing Task Outputs. OCR tasks aim to gather all text strings along with their corresponding coordinates in the document image, while parsing tasks aim to extract semantic information and present it in the required structured format.
  • Figure 2: Differences among Traditional Approaches and Ours. (a) OCR-based methods leverage text strings and coordinates derived from an external OCR module. (b) End-to-end approach including OCR modules enable comprehensive optimization. (c) End-to-end approaches using sequence generation decode the parsed texts without coordinates. (d) Our approach, termed CREPE, incorporates a multi-head architecture that preserves the advantages of an OCR-free method while uniquely enabling the provision of text coordinates.
  • Figure 3: Overview of CREPE Architecture. CREPE employs a multi-head architecture designed to generate parsing outputs and corresponding text coordinates concurrently. It utilizes the special token </ocr> that indicates a text segment is associated with coordinates output from different heads.
  • Figure 4: Sample Output from CREPE. The converted output comprises both parsing results and associated text coordinates. Corresponding outputs are highlighted using matching colors. The alignment between the text and coordinates is facilitated through the use of the special token </ocr>.
  • Figure 5: OCR Task Result of the Pretrained Model. The text sequence and the corresponding bounding boxes were obtained from sequence head and coordinate head, respectively.
  • ...and 6 more figures