Table of Contents
Fetching ...

End-to-End Semi-Supervised approach with Modulated Object Queries for Table Detection in Documents

Iqraa Ehsan, Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal

TL;DR

The paper addresses data-efficient table detection under limited annotations by introducing an end-to-end DETR-based semi-supervised detector that employs dual query assignment (one-to-one ${\mathcal{L}_{o2o}}$ and one-to-many ${\mathcal{L}_{o2m}}$) and a teacher–student EMA framework to generate high-quality pseudo-labels. It formalizes two query sets and associated losses, leveraging ground-truth augmentation and Hungarian matching to maintain accuracy while eliminating NMS during inference. On PubLayNet and TableBank with as little as 30% labeled data, the approach achieves state-of-the-art mAPs (approximately $95.7\%$ on TableBank-word and $97.9\%$ on PubLayNet), outperforming prior semi-supervised methods by about 7–8 points. The method reduces labeling costs and increases training efficiency, with promising potential for extending to table-structure recognition in document analysis.

Abstract

Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and Non-Maximum Suppression (NMS) in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.

End-to-End Semi-Supervised approach with Modulated Object Queries for Table Detection in Documents

TL;DR

The paper addresses data-efficient table detection under limited annotations by introducing an end-to-end DETR-based semi-supervised detector that employs dual query assignment (one-to-one and one-to-many ) and a teacher–student EMA framework to generate high-quality pseudo-labels. It formalizes two query sets and associated losses, leveraging ground-truth augmentation and Hungarian matching to maintain accuracy while eliminating NMS during inference. On PubLayNet and TableBank with as little as 30% labeled data, the approach achieves state-of-the-art mAPs (approximately on TableBank-word and on PubLayNet), outperforming prior semi-supervised methods by about 7–8 points. The method reduces labeling costs and increases training efficiency, with promising potential for extending to table-structure recognition in document analysis.

Abstract

Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and Non-Maximum Suppression (NMS) in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.
Paper Structure (18 sections, 9 equations, 8 figures, 12 tables)

This paper contains 18 sections, 9 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Performance comparison of our approach with previous supervised and semi-supervised table detection approaches on TableBank-Both and Publaynet datasets. In a semi-supervised setting, we perform experiments with a 30$\%$ label data. The semi-supervised deformable transformer is referred to as the Def. Semi shehzadi_semi-detr_table1. For an extensive summary of the results, please refer to Table \ref{['tab:tableno8']}.
  • Figure 2: Our approach contains two modules, the student and teacher module. The decoder of the student-teacher modules utilize both one-to-one and one-to-many assignment strategies. The decoder of the teacher module employs the one-to-many assignment strategy to generate high-quality pseudo-labels and improve performance with limited data. The one-to-many assignment strategy in the student module provides high-quality predictions, while the one-to-one matching strategy removes duplications.
  • Figure 3: Our semi-supervised approach: It focuses on employing one-to-many and one-to-one matching strategies in a semi-supervised setting and incorporates both labeled data and unlabeled data during training. The framework consists of two modules: student module and teacher module. The teacher module takes unlabeled images after employing weak augmentation prior and generates their pseudo-labels. The student module is designed to handle both labeled and unlabeled images, applying strong augmentation specifically to the unlabeled images. During the training process, the student module utilizes an EMA technique to consistently update the teacher module.
  • Figure 4: Visualization of predictions of our approach on different datastets. It highlights that the inclusion of both one-to-one and one-to-many matching strategis enhance the model's accuracy. The ground truth is depicted by the blue boxes, while the green boxes display the results obtained through our approach.
  • Figure 5: Comparative performance analysis of our approach using the Tablebank, Publaynet, and ICDAR datasets. Experiments were conducted employing a ResNet-50 backbone with three distinct data splits: 10$\%$, 30$\%$, and 50$\%$. The results are presented as follows: (a) Mean Average Precision (mAP) within the IoU threshold range of 50$\%$ to 95$\%$, and (b) Average Recall for large objects ($AR_{L}$) within the IoU threshold range of 50$\%$ to 95$\%$.
  • ...and 3 more figures