Table of Contents
Fetching ...

Towards End-to-End Semi-Supervised Table Detection with Semantic Aligned Matching Transformer

Tahira Shehzadi, Shalini Sarode, Didier Stricker, Muhammad Zeshan Afzal

TL;DR

This work tackles table detection in document images under limited supervision by introducing a semi-supervised SAM-DETR framework that eliminates anchor proposals and post-processing steps such as NMS. It integrates a Semantics Aligner within a teacher–student DETR-based architecture to achieve semantic-aligned matching between object queries and encoded features, with RoIAlign-based query resampling and salient-point guidance. A Top-K pseudo-label filtering strategy along with an EMA teacher–student training regime enables effective learning from unlabeled data, and the approach is evaluated across TableBank, PubLayNet, PubTables, and ICDAR-19, achieving high mAP even with small fractions of labeled data ($L = L_s + \alpha L_u$, with $\alpha = 0.25$). The results demonstrate strong performance and recall improvements over prior semi-supervised and transformer-based methods, highlighting the method’s robustness to diverse table structures and its practical potential in scalable document analysis.

Abstract

Table detection within document images is a crucial task in document processing, involving the identification and localization of tables. Recent strides in deep learning have substantially improved the accuracy of this task, but it still heavily relies on large labeled datasets for effective training. Several semi-supervised approaches have emerged to overcome this challenge, often employing CNN-based detectors with anchor proposals and post-processing techniques like non-maximal suppression (NMS). However, recent advancements in the field have shifted the focus towards transformer-based techniques, eliminating the need for NMS and emphasizing object queries and attention mechanisms. Previous research has focused on two key areas to improve transformer-based detectors: refining the quality of object queries and optimizing attention mechanisms. However, increasing object queries can introduce redundancy, while adjustments to the attention mechanism can increase complexity. To address these challenges, we introduce a semi-supervised approach employing SAM-DETR, a novel approach for precise alignment between object queries and target features. Our approach demonstrates remarkable reductions in false positives and substantial enhancements in table detection performance, particularly in complex documents characterized by diverse table structures. This work provides more efficient and accurate table detection in semi-supervised settings.

Towards End-to-End Semi-Supervised Table Detection with Semantic Aligned Matching Transformer

TL;DR

This work tackles table detection in document images under limited supervision by introducing a semi-supervised SAM-DETR framework that eliminates anchor proposals and post-processing steps such as NMS. It integrates a Semantics Aligner within a teacher–student DETR-based architecture to achieve semantic-aligned matching between object queries and encoded features, with RoIAlign-based query resampling and salient-point guidance. A Top-K pseudo-label filtering strategy along with an EMA teacher–student training regime enables effective learning from unlabeled data, and the approach is evaluated across TableBank, PubLayNet, PubTables, and ICDAR-19, achieving high mAP even with small fractions of labeled data (, with ). The results demonstrate strong performance and recall improvements over prior semi-supervised and transformer-based methods, highlighting the method’s robustness to diverse table structures and its practical potential in scalable document analysis.

Abstract

Table detection within document images is a crucial task in document processing, involving the identification and localization of tables. Recent strides in deep learning have substantially improved the accuracy of this task, but it still heavily relies on large labeled datasets for effective training. Several semi-supervised approaches have emerged to overcome this challenge, often employing CNN-based detectors with anchor proposals and post-processing techniques like non-maximal suppression (NMS). However, recent advancements in the field have shifted the focus towards transformer-based techniques, eliminating the need for NMS and emphasizing object queries and attention mechanisms. Previous research has focused on two key areas to improve transformer-based detectors: refining the quality of object queries and optimizing attention mechanisms. However, increasing object queries can introduce redundancy, while adjustments to the attention mechanism can increase complexity. To address these challenges, we introduce a semi-supervised approach employing SAM-DETR, a novel approach for precise alignment between object queries and target features. Our approach demonstrates remarkable reductions in false positives and substantial enhancements in table detection performance, particularly in complex documents characterized by diverse table structures. This work provides more efficient and accurate table detection in semi-supervised settings.
Paper Structure (19 sections, 16 equations, 3 figures, 13 tables)

This paper contains 19 sections, 16 equations, 3 figures, 13 tables.

Figures (3)

  • Figure 1: Overview of SAM-DETR sm-detr34. (a) the architecture of a single decoder layer in SAM-DETR, showing the role of learnable reference boxes in generating position embeddings for each object query. (b) the pipeline of the Semantics Aligner. The process includes the use of reference boxes for feature extraction via RoIAlign, the prediction of salient points in the targeted region, and the generation of new, semantically aligned query embeddings, which are further refined by incorporating attributes from previous queries. Image from sm-detr34.
  • Figure 2: Illustration of our Semi-Supervised Table Detection Framework. This dual-component system involves a Student module that learns from a mix of labeled data and strongly augmented unlabeled images, and a Teacher module that refines its understanding using weakly augmented unlabeled images. The Student module updates the Teacher module using Exponential Moving-Average (EMA) during training. Within this setup, the Semantics Aligner (SA) is key in the decoder of the student-teacher framework, fine-tuning the relationship between object queries and the image features that have been encoded, ensuring a more effective and accurate detection of tables in various documents.
  • Figure 3: Visual Analysis of our semi-supervised approach. Here, blue represents ground truth and red denotes our predictions results using 10% labels on PubLayNet datatset.