Table of Contents
Fetching ...

TableBank: A Benchmark Dataset for Table Detection and Recognition

Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, Zhoujun Li

TL;DR

TableBank introduces a large-scale, weakly supervised dataset for image-based table detection and recognition, derived from Word and LaTeX documents to achieve 417k labeled tables across diverse domains. It establishes strong baselines using Faster R-CNN for table detection and an image-to-markup encoder-decoder for table structure recognition, demonstrating domain-specific performance and improved cross-domain generalization when training on mixed-domain data. The results highlight the necessity of large, varied training data for robust table analysis and show deep learning methods outperform traditional OCR-based tools on this task. The authors publicly release TableBank and plan to expand to additional domains and finer-grained document components to further advance table analysis research.

Abstract

We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models are available at \url{https://github.com/doc-analysis/TableBank}.

TableBank: A Benchmark Dataset for Table Detection and Recognition

TL;DR

TableBank introduces a large-scale, weakly supervised dataset for image-based table detection and recognition, derived from Word and LaTeX documents to achieve 417k labeled tables across diverse domains. It establishes strong baselines using Faster R-CNN for table detection and an image-to-markup encoder-decoder for table structure recognition, demonstrating domain-specific performance and improved cross-domain generalization when training on mixed-domain data. The results highlight the necessity of large, varied training data for robust table analysis and show deep learning methods outperform traditional OCR-based tools on this task. The authors publicly release TableBank and plan to expand to additional domains and finer-grained document components to further advance table analysis research.

Abstract

We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models are available at \url{https://github.com/doc-analysis/TableBank}.

Paper Structure

This paper contains 19 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Tables in electronic documents on the web with different layouts and formats. Among them, Figure (a)(b)(c) come from the latex format documents while Figure (d) comes from a word format document
  • Figure 2: Data processing pipeline
  • Figure 3: The tables can be identified and labeled from "$<$w:tbl$>$" and "$<$/w:tbl$>$" tags in the Office XML code
  • Figure 4: Figure (a): Missed a table; Figure (b): Two tables were annotated as one; Figure (c): One of the annotations is incomplete; Figure (c): Some annotations are overlaped.
  • Figure 5: A Table-to-HTML example where '$<$cell_y$>$' denotes cells with content while '$<$cell_n$>$' represents cells without content
  • ...and 3 more figures