TABLET: Table Structure Recognition using Encoder-only Transformers
Qiyu Hou, Jun Wang
TL;DR
TABLET introduces aSplit-Merge framework using encoder-only Transformers to tackle table structure recognition in large, dense tables. The split model performs horizontal and vertical line splitting via dual 1D Transformers on high-resolution feature streams, while the merge model uses RoIAlign-based grid-cell features and a Transformer with 2D positional embeddings to classify grid cells into OTSL tokens, producing HTML layouts. Extensive experiments on FinTabNet and PubTabNet show superior accuracy and competitive TEDS scores, with strong robustness against misalignment and much faster inference than autoregressive approaches. The approach is well-suited for industrial deployment due to high accuracy, reduced resolution loss, and fast processing speeds, even on large-scale business documents.
Abstract
To address the challenges of table structure recognition, we propose a novel Split-Merge-based top-down model optimized for large, densely populated tables. Our approach formulates row and column splitting as sequence labeling tasks, utilizing dual Transformer encoders to capture feature interactions. The merging process is framed as a grid cell classification task, leveraging an additional Transformer encoder to ensure accurate and coherent merging. By eliminating unstable bounding box predictions, our method reduces resolution loss and computational complexity, achieving high accuracy while maintaining fast processing speed. Extensive experiments on FinTabNet and PubTabNet demonstrate the superiority of our model over existing approaches, particularly in real-world applications. Our method offers a robust, scalable, and efficient solution for large-scale table recognition, making it well-suited for industrial deployment.
