Table of Contents
Fetching ...

DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

Nikitha SR, Tarun Ram Menta, Mausoom Sarkar

TL;DR

DoPTA tackles visual document understanding by grounding image representations in the textual content of document images without relying on OCR at inference. It introduces a novel IoU-guided patch-text alignment loss that aligns text tokens to image patches, together with a MAE-like image reconstruction loss, to produce a robust DoPTA encoder. Across document image classification, layout analysis, and text detection, DoPTA achieves state-of-the-art results with fewer parameters and substantially less pre-training than prior OCR-based or multimodal methods, while enabling OCR-free inference and faster deployment. The approach demonstrates the practical impact of leveraging text within images for fine-grained visual understanding of dense documents and suggests directions for further efficiency and accuracy gains.

Abstract

The advent of multimodal learning has brought a significant improvement in document AI. Documents are now treated as multimodal entities, incorporating both textual and visual information for downstream analysis. However, works in this space are often focused on the textual aspect, using the visual space as auxiliary information. While some works have explored pure vision based techniques for document image understanding, they require OCR identified text as input during inference, or do not align with text in their learning procedure. Therefore, we present a novel image-text alignment technique specially designed for leveraging the textual information in document images to improve performance on visual tasks. Our document encoder model DoPTA - trained with this technique demonstrates strong performance on a wide range of document image understanding tasks, without requiring OCR during inference. Combined with an auxiliary reconstruction objective, DoPTA consistently outperforms larger models, while using significantly lesser pre-training compute. DoPTA also sets new state-of-the art results on D4LA, and FUNSD, two challenging document visual analysis benchmarks.

DoPTA: Improving Document Layout Analysis using Patch-Text Alignment

TL;DR

DoPTA tackles visual document understanding by grounding image representations in the textual content of document images without relying on OCR at inference. It introduces a novel IoU-guided patch-text alignment loss that aligns text tokens to image patches, together with a MAE-like image reconstruction loss, to produce a robust DoPTA encoder. Across document image classification, layout analysis, and text detection, DoPTA achieves state-of-the-art results with fewer parameters and substantially less pre-training than prior OCR-based or multimodal methods, while enabling OCR-free inference and faster deployment. The approach demonstrates the practical impact of leveraging text within images for fine-grained visual understanding of dense documents and suggests directions for further efficiency and accuracy gains.

Abstract

The advent of multimodal learning has brought a significant improvement in document AI. Documents are now treated as multimodal entities, incorporating both textual and visual information for downstream analysis. However, works in this space are often focused on the textual aspect, using the visual space as auxiliary information. While some works have explored pure vision based techniques for document image understanding, they require OCR identified text as input during inference, or do not align with text in their learning procedure. Therefore, we present a novel image-text alignment technique specially designed for leveraging the textual information in document images to improve performance on visual tasks. Our document encoder model DoPTA - trained with this technique demonstrates strong performance on a wide range of document image understanding tasks, without requiring OCR during inference. Combined with an auxiliary reconstruction objective, DoPTA consistently outperforms larger models, while using significantly lesser pre-training compute. DoPTA also sets new state-of-the art results on D4LA, and FUNSD, two challenging document visual analysis benchmarks.

Paper Structure

This paper contains 20 sections, 4 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Our method achieves superior FPS due to the OCR free inference setting while also setting SOTA mAP as compared to several existing methods. Model(OCR) denotes the FPS when OCR parsing is taken into account for computing inference time.
  • Figure 2: Pre-training of DoPTA. Only the image encoder is required for downstream usage. Refer section Sec. \ref{['sec:method']} for details.
  • Figure 3: Heatmap visualisation of the normalised dot product similarity of image region embeddings with the text embedding for the token 'phosphine' taken from DoPTA model. Additional qualitative results are presented in Appendix \ref{['app:examples']}.
  • Figure 4: Results of DoPTA and existing SOTA document encoder models. DoPTA outperforms other methods on multiple benchmarks, despite having less parameters, and a significantly shorter pre-training schedule. Refer to Sec. \ref{['sec:experiments']} for more details of individual benchmarks
  • Figure 5: Failure case of DoPTA on layout analysis on D4LA benchmark. Left is DoPTA. Right is VGT. DoPTA incorrectly marks the central region as RegionKV, which was found to be a common error mode.
  • ...and 4 more figures