Table of Contents
Fetching ...

A large-scale dataset for end-to-end table recognition in the wild

Fan Yang, Lei Hu, Xinwu Liu, Shuangping Huang, Zhenghui Gu

TL;DR

TabRecSet introduces a large-scale, bilingual table recognition dataset for end-to-end TR under wild conditions, featuring polygon-based spatial annotations for irregular tables and complete TD/TSR/TCR annotations. It includes a visual annotation tool TableMe, an auto-annotation TSR module, and border-incomplete table generation to boost diversity. The paper details data collection, cleaning, annotation workflows, and cross-checking/proofreading to ensure data quality, and demonstrates usability via multi-task evaluation across TSR, TCR, and TD baselines, showing meaningful gains when fine-tuning models on TabRecSet. The dataset significantly broadens scenario diversity and annotation flexibility, enabling robust evaluation of end-to-end table recognition in real-world images and informing future model development.

Abstract

Table recognition (TR) is one of the research hotspots in pattern recognition, which aims to extract information from tables in an image. Common table recognition tasks include table detection (TD), table structure recognition (TSR) and table content recognition (TCR). TD is to locate tables in the image, TCR recognizes text content, and TSR recognizes spatial ogical structure. Currently, the end-to-end TR in real scenarios, accomplishing the three sub-tasks simultaneously, is yet an unexplored research area. One major factor that inhibits researchers is the lack of a benchmark dataset. To this end, we propose a new large-scale dataset named Table Recognition Set (TabRecSet) with diverse table forms sourcing from multiple scenarios in the wild, providing complete annotation dedicated to end-to-end TR research. It is the largest and first bi-lingual dataset for end-to-end TR, with 38.1K tables in which 20.4K are in English\, and 17.7K are in Chinese. The samples have diverse forms, such as the border-complete and -incomplete table, regular and irregular table (rotated, distorted, etc.). The scenarios are multiple in the wild, varying from scanned to camera-taken images, documents to Excel tables, educational test papers to financial invoices. The annotations are complete, consisting of the table body spatial annotation, cell spatial logical annotation and text content for TD, TSR and TCR, respectively. The spatial annotation utilizes the polygon instead of the bounding box or quadrilateral adopted by most datasets. The polygon spatial annotation is more suitable for irregular tables that are common in wild scenarios. Additionally, we propose a visualized and interactive annotation tool named TableMe to improve the efficiency and quality of table annotation.

A large-scale dataset for end-to-end table recognition in the wild

TL;DR

TabRecSet introduces a large-scale, bilingual table recognition dataset for end-to-end TR under wild conditions, featuring polygon-based spatial annotations for irregular tables and complete TD/TSR/TCR annotations. It includes a visual annotation tool TableMe, an auto-annotation TSR module, and border-incomplete table generation to boost diversity. The paper details data collection, cleaning, annotation workflows, and cross-checking/proofreading to ensure data quality, and demonstrates usability via multi-task evaluation across TSR, TCR, and TD baselines, showing meaningful gains when fine-tuning models on TabRecSet. The dataset significantly broadens scenario diversity and annotation flexibility, enabling robust evaluation of end-to-end table recognition in real-world images and informing future model development.

Abstract

Table recognition (TR) is one of the research hotspots in pattern recognition, which aims to extract information from tables in an image. Common table recognition tasks include table detection (TD), table structure recognition (TSR) and table content recognition (TCR). TD is to locate tables in the image, TCR recognizes text content, and TSR recognizes spatial ogical structure. Currently, the end-to-end TR in real scenarios, accomplishing the three sub-tasks simultaneously, is yet an unexplored research area. One major factor that inhibits researchers is the lack of a benchmark dataset. To this end, we propose a new large-scale dataset named Table Recognition Set (TabRecSet) with diverse table forms sourcing from multiple scenarios in the wild, providing complete annotation dedicated to end-to-end TR research. It is the largest and first bi-lingual dataset for end-to-end TR, with 38.1K tables in which 20.4K are in English\, and 17.7K are in Chinese. The samples have diverse forms, such as the border-complete and -incomplete table, regular and irregular table (rotated, distorted, etc.). The scenarios are multiple in the wild, varying from scanned to camera-taken images, documents to Excel tables, educational test papers to financial invoices. The annotations are complete, consisting of the table body spatial annotation, cell spatial logical annotation and text content for TD, TSR and TCR, respectively. The spatial annotation utilizes the polygon instead of the bounding box or quadrilateral adopted by most datasets. The polygon spatial annotation is more suitable for irregular tables that are common in wild scenarios. Additionally, we propose a visualized and interactive annotation tool named TableMe to improve the efficiency and quality of table annotation.
Paper Structure (3 sections, 11 figures, 6 tables)

This paper contains 3 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Some representative samples in TabRecSet. The scenarios include the document images, ingredients form of foods, Excel tables and invoice tables. Because of the page distortions or camera views, most tables are irregular, i.e., with rotations, inclinations, concave/convex/wrinkle distortions, etc. Some special table forms are exhibited, e.g., the nested table, under- and over-exposed table, border-incomplete table, table with hand-written contents and hand-drawn table.
  • Figure 2: The creation flow chart of TabRecSet. The data collection aims to collect raw image samples and outputs a Raw Data Pool, which stores candidate data samples. The data cleaning step generates clean samples from Raw Data Pool and gathers them into a Clean Dataset. In the data annotation step, we use TableMe to annotate the clean sample and save the annotation in the TabRecSet annotation format. This step is aided by several auto-annotation algorithms to improve efficiency. The border-incomplete table generation step aims to enlarge the scale TabRecSet by our proposed three-line table generating algorithm.
  • Figure 3: An intuitive illustration of the data annotation step showed in Figure \ref{['fig: Data creation flow chart']}. Please zoom in for details.
  • Figure 4: Three annotation instances in the cell-wise annotation format.
  • Figure 5: The main interface of TableMe. Please zoom in for details.
  • ...and 6 more figures