Table of Contents
Fetching ...

A Hybrid Approach for Document Layout Analysis in Document images

Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal

TL;DR

This work targets robust document layout analysis by integrating a Transformer-based detector (DINO) with a novel query encoding pipeline and a hybrid training scheme that blends one-to-one and one-to-many matching. By extracting multi-scale features with a ResNet-50 backbone, enhancing object queries through backbone-derived high-level features, and combining decoder queries with refined queries, the approach improves detection of small graphical elements such as headers and captions. The paper demonstrates state-of-the-art performance on PubLayNet, DocLayNet, and PubTables, with notable gains in challenging element categories and strong ablations validating the contributions. The resulting framework offers improved accuracy and efficiency for converting document images into editable, parsable formats, enhancing information retrieval and data extraction workflows.

Abstract

Document layout analysis involves understanding the arrangement of elements within a document. This paper navigates the complexities of understanding various elements within document images, such as text, images, tables, and headings. The approach employs an advanced Transformer-based object detection network as an innovative graphical page object detector for identifying tables, figures, and displayed elements. We introduce a query encoding mechanism to provide high-quality object queries for contrastive learning, enhancing efficiency in the decoder phase. We also present a hybrid matching scheme that integrates the decoder's original one-to-one matching strategy with the one-to-many matching strategy during the training phase. This approach aims to improve the model's accuracy and versatility in detecting various graphical elements on a page. Our experiments on PubLayNet, DocLayNet, and PubTables benchmarks show that our approach outperforms current state-of-the-art methods. It achieves an average precision of 97.3% on PubLayNet, 81.6% on DocLayNet, and 98.6 on PubTables, demonstrating its superior performance in layout analysis. These advancements not only enhance the conversion of document images into editable and accessible formats but also streamline information retrieval and data extraction processes.

A Hybrid Approach for Document Layout Analysis in Document images

TL;DR

This work targets robust document layout analysis by integrating a Transformer-based detector (DINO) with a novel query encoding pipeline and a hybrid training scheme that blends one-to-one and one-to-many matching. By extracting multi-scale features with a ResNet-50 backbone, enhancing object queries through backbone-derived high-level features, and combining decoder queries with refined queries, the approach improves detection of small graphical elements such as headers and captions. The paper demonstrates state-of-the-art performance on PubLayNet, DocLayNet, and PubTables, with notable gains in challenging element categories and strong ablations validating the contributions. The resulting framework offers improved accuracy and efficiency for converting document images into editable, parsable formats, enhancing information retrieval and data extraction workflows.

Abstract

Document layout analysis involves understanding the arrangement of elements within a document. This paper navigates the complexities of understanding various elements within document images, such as text, images, tables, and headings. The approach employs an advanced Transformer-based object detection network as an innovative graphical page object detector for identifying tables, figures, and displayed elements. We introduce a query encoding mechanism to provide high-quality object queries for contrastive learning, enhancing efficiency in the decoder phase. We also present a hybrid matching scheme that integrates the decoder's original one-to-one matching strategy with the one-to-many matching strategy during the training phase. This approach aims to improve the model's accuracy and versatility in detecting various graphical elements on a page. Our experiments on PubLayNet, DocLayNet, and PubTables benchmarks show that our approach outperforms current state-of-the-art methods. It achieves an average precision of 97.3% on PubLayNet, 81.6% on DocLayNet, and 98.6 on PubTables, demonstrating its superior performance in layout analysis. These advancements not only enhance the conversion of document images into editable and accessible formats but also streamline information retrieval and data extraction processes.
Paper Structure (14 sections, 8 equations, 3 figures, 6 tables)

This paper contains 14 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Diverse layouts and element types in the DocLayNet Dataset, including elements such as captions, footnotes, formulas, and more. It underscores the challenges in document layout analysis, like interpreting dense text and categorizing diverse elements.
  • Figure 2: Overview of our approach for Document Layout Analysis. The input image is processed through a CNN backbone to extract features, which are then passed to a Transformer encoder-decoder network. The encoder processes the features globally, while the decoder uses object queries to interact with the encoded features and predict bounding boxes and classes for each object in the image. Our approach incorporates an enhanced query encoding mechanism to improve decoder efficiency and a query selection scheme that combines one-to-one and one-to-many matching strategies, improving accuracy and adaptability in identifying various graphical elements across documents.
  • Figure 3: Visual analysis of our approach on the DocLayNet dataset. Here, blue color represents ground truth, red denotes prediction by our approach. It illustrates the model's proficiency in identifying small layout elements, specifically highlighting its accuracy in detecting page titles, headers, and footers.