Table of Contents
Fetching ...

TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment

Chunxia Qin, Chenyu Liu, Pengcheng Xia, Jun Du, Baocai Yin, Bing Yin, Cong Liu

Abstract

Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment. TDATR adopts a ``perceive-then-fuse'' strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR. We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.

TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment

Abstract

Tables are pervasive in diverse documents, making table recognition (TR) a fundamental task in document analysis. Existing modular TR pipelines separately model table structure and content, leading to suboptimal integration and complex workflows. End-to-end approaches rely heavily on large-scale TR data and struggle in data-constrained scenarios. To address these issues, we propose TDATR (Table Detail-Aware Table Recognition) improves end-to-end TR through table detail-aware learning and cell-level visual alignment. TDATR adopts a ``perceive-then-fuse'' strategy. The model first performs table detail-aware learning to jointly perceive table structure and content through multiple structure understanding and content recognition tasks designed under a language modeling paradigm. These tasks can naturally leverage document data from diverse scenarios to enhance model robustness. The model then integrates implicit table details to generate structured HTML outputs, enabling more efficient TR modeling when trained with limited data. Furthermore, we design a structure-guided cell localization module integrated into the end-to-end TR framework, which efficiently locates cell and strengthens vision-language alignment. It enhances the interpretability and accuracy of TR. We achieve state-of-the-art or highly competitive performance on seven benchmarks without dataset-specific fine-tuning.
Paper Structure (30 sections, 4 equations, 11 figures, 9 tables)

This paper contains 30 sections, 4 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Comparison of different table recognition paradigms. (a) Modular TR pipelines suffer from complex workflows and sub-optimization. (b) End-to-end TR models underperform in data-scarce scenarios due to weak detail perception. (c) Our "perceive-then-fuse" framework enhances structure and content awareness and unifies TR and cell localization for robust end-to-end TR.
  • Figure 2: (a) The architecture of the model. The model consists of a d vision encoder, a language decoder, and a structure-guided cell localization module, which aggregates cell representations based on TR priors refines cell boxes using multi-resolution visual features. (b) The perceive-then-fuse training strategy for end-to-end table recognition. In the table detail-aware learning phase, we design table structure understanding and content recognition tasks under a language modeling paradigm to enhance fine-grained perception. In the fusion phase, we fine-tune the model for table HTML parsing by aggregating the learned implicitly table details, while jointly training the cell localization module to strengthen cell-level visual alignment.
  • Figure 3: The visualization of cell localization on challenging tables, including borderless (b,h), complex-structured (e,d), long (b), and low-quality images (a,g,h).
  • Figure 4: The pipeline of unified multi-source table data processing. The pipeline normalizes heterogeneous table annotations from various sources into a unified representation for model training.
  • Figure 5: Illustration of table content recognition tasks. These tasks leverage diverse document data to enable text recognition, text localization, and reading-order understanding.
  • ...and 6 more figures