Rethinking Detection Based Table Structure Recognition for Visually Rich Document Images

Bin Xiao; Murat Simsek; Burak Kantarci; Ala Abu Alkheir

Rethinking Detection Based Table Structure Recognition for Visually Rich Document Images

Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir

TL;DR

The paper scrutinizes detection-based Table Structure Recognition (TSR) and identifies misalignments between detection objectives and TSR structure metrics, as well as formulation gaps. It proposes TSRDet, a tailored Cascade R-CNN pipeline that converts multi-label TSR into a single-label setting, expands region proposals and aspect ratios, and introduces a Spatial Attention Module to capture long-range context while preserving local details. Empirical results show competitive or superior performance across SciTSR, FinTabNet, PubTables1M, and PubTabNet-derived evaluations, with notable gains in structure-only $TEDS$ and reliable COCO metrics. The findings demonstrate that, with careful problem framing and simple architectural enhancements, a detection-based TSR approach can outperform or match graph-based and image-to-sequence methods, offering practical guidance for future TSR model design.

Abstract

Table Structure Recognition (TSR) is a widely discussed task aiming at transforming unstructured table images into structured formats, such as HTML sequences, to make text-only models, such as ChatGPT, that can further process these tables. One type of solution is using detection models to detect table components, such as columns and rows, then applying a rule-based post-processing method to convert detection results into HTML sequences. However, existing detection-based models usually cannot perform as well as other types of solutions regarding cell-level TSR metrics, such as TEDS, and the underlying reasons limiting the performance of these models on the TSR task are also not well-explored. Therefore, we revisit existing detection-based models comprehensively and explore the underlying reasons hindering these models' performance, including the improper problem definition, the mismatch issue of detection and TSR metrics, the characteristics of detection models, and the impact of local and long-range features extraction. Based on our analysis and findings, we apply simple methods to tailor a typical two-stage detection model, Cascade R-CNN, for the TSR task. The experimental results show that the tailored Cascade R-CNN based model can improve the base Cascade R-CNN model by 16.35\% on the FinTabNet dataset regarding the structure-only TEDS, outperforming other types of state-of-the-art methods, demonstrating that our findings can be a guideline for improving detection-based TSR models and that a purely detection-based solution is competitive with other types of solutions, such as graph-based and image-to-sequence solutions.

Rethinking Detection Based Table Structure Recognition for Visually Rich Document Images

TL;DR

and reliable COCO metrics. The findings demonstrate that, with careful problem framing and simple architectural enhancements, a detection-based TSR approach can outperform or match graph-based and image-to-sequence methods, offering practical guidance for future TSR model design.

Abstract

Paper Structure (26 sections, 12 equations, 10 figures, 9 tables)

This paper contains 26 sections, 12 equations, 10 figures, 9 tables.

Introduction
Related Work
Object Detection Models
Table Structure Recognition
Rethinking Detection-based TSR models
Preliminaries
Cascade R-CNN
Sparse R-CNN
Rethinking Problem Formulations
Revisiting Region Proposal Generation
Rethinking Detection and TSR Metrics
Rethinking Feature Extraction
Proposed Method
Proposed Problem Formulation
Tuning Parameters of RPN
...and 11 more sections

Figures (10)

Figure 1: Overall architecture of Cascade R-CNN.
Figure 2: Overall architecture of Sparse R-CNN.
Figure 3: Different problem formulations for the detection-base TSR.
Figure 4: Statistics of aspect ratio values of COCO and FinTabNet training sets. When an aspect ratio is less than 1, its multiplicative inverse counts the number of aspect ratios.
Figure 5: A sample from the FinTabNet dataset with ground truth boxes larger than the minimum bounding boxes for table structure. We only show the annotations of Columns for simplicity.
...and 5 more figures

Rethinking Detection Based Table Structure Recognition for Visually Rich Document Images

TL;DR

Abstract

Rethinking Detection Based Table Structure Recognition for Visually Rich Document Images

Authors

TL;DR

Abstract

Table of Contents

Figures (10)