TableBank: A Benchmark Dataset for Table Detection and Recognition
Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, Zhoujun Li
TL;DR
TableBank introduces a large-scale, weakly supervised dataset for image-based table detection and recognition, derived from Word and LaTeX documents to achieve 417k labeled tables across diverse domains. It establishes strong baselines using Faster R-CNN for table detection and an image-to-markup encoder-decoder for table structure recognition, demonstrating domain-specific performance and improved cross-domain generalization when training on mixed-domain data. The results highlight the necessity of large, varied training data for robust table analysis and show deep learning methods outperform traditional OCR-based tools on this task. The authors publicly release TableBank and plan to expand to additional domains and finer-grained document components to further advance table analysis research.
Abstract
We present TableBank, a new image-based table detection and recognition dataset built with novel weak supervision from Word and Latex documents on the internet. Existing research for image-based table detection and recognition usually fine-tunes pre-trained models on out-of-domain data with a few thousand human-labeled examples, which is difficult to generalize on real-world applications. With TableBank that contains 417K high quality labeled tables, we build several strong baselines using state-of-the-art models with deep neural networks. We make TableBank publicly available and hope it will empower more deep learning approaches in the table detection and recognition task. The dataset and models are available at \url{https://github.com/doc-analysis/TableBank}.
