Table of Contents
Fetching ...

Synthesizing Realistic Data for Table Recognition

Qiyu Hou, Jun Wang, Meixuan Qiao, Lujun Tian

TL;DR

The paper addresses the scarcity and domain-specific challenges of real-world table annotations by introducing a structure- and content-guided synthesis framework that leverages actual target-domain tables to generate realistic annotated data. It constructs the first large-scale dataset and benchmark for Chinese financial announcements and augments the English FinTabNet with more complex spanning-cell tables, validating the approach with end-to-end table recognition models. Key contributions include a data-driven analysis of real table distributions, a style-profile-based transformation pipeline, and an image-based rendering workflow that outputs model-ready annotations. The results demonstrate practical improvements in recognizing complex tables and support broader deployment of domain-tailored synthetic data to boost cross-domain generalization.

Abstract

To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic styles found in the target domain. By leveraging the actual structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset in this domain. We used this dataset to train several recent deep learning-based end-to-end table recognition models. Additionally, we have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data, thereby effectively validating our method's practicality and effectiveness. Furthermore, we applied our synthesis method to augment the FinTabNet dataset, extracted from English financial announcements, by increasing the proportion of tables with multiple spanning cells to introduce greater complexity. Our experiments show that models trained on this augmented dataset achieve comprehensive improvements in performance, especially in the recognition of tables with multiple spanning cells.

Synthesizing Realistic Data for Table Recognition

TL;DR

The paper addresses the scarcity and domain-specific challenges of real-world table annotations by introducing a structure- and content-guided synthesis framework that leverages actual target-domain tables to generate realistic annotated data. It constructs the first large-scale dataset and benchmark for Chinese financial announcements and augments the English FinTabNet with more complex spanning-cell tables, validating the approach with end-to-end table recognition models. Key contributions include a data-driven analysis of real table distributions, a style-profile-based transformation pipeline, and an image-based rendering workflow that outputs model-ready annotations. The results demonstrate practical improvements in recognizing complex tables and support broader deployment of domain-tailored synthetic data to boost cross-domain generalization.

Abstract

To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic styles found in the target domain. By leveraging the actual structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset in this domain. We used this dataset to train several recent deep learning-based end-to-end table recognition models. Additionally, we have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data, thereby effectively validating our method's practicality and effectiveness. Furthermore, we applied our synthesis method to augment the FinTabNet dataset, extracted from English financial announcements, by increasing the proportion of tables with multiple spanning cells to introduce greater complexity. Our experiments show that models trained on this augmented dataset achieve comprehensive improvements in performance, especially in the recognition of tables with multiple spanning cells.
Paper Structure (22 sections, 10 figures, 6 tables)

This paper contains 22 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The transformation of an original bordered table into two borderless tables and one bordered table with different style
  • Figure 2: Real examples of tables that have similar content but display vastly different visual styles.
  • Figure 3: System Structure of the table annotation data synthesis method
  • Figure 4: Illustration of table cells.
  • Figure 5: Illustration of text lines.
  • ...and 5 more figures