Table of Contents
Fetching ...

ClusterTabNet: Supervised clustering method for table detection and table structure recognition

Marek Polewczyk, Marco Spinaci

TL;DR

ClusterTabNet reframes table detection and table structure recognition as supervised clustering over OCR word boxes, predicting an adjacency matrix of size $n \times n$ with a transformer encoder. It outputs distinct adjacency heads for tables, rows, columns, and headers, enabling words to be grouped into cohesive table structures via connected components and post-processing. The approach is lightweight (approximately $5\times 10^6$ parameters) and optionally benefits from image patches, achieving competitive or superior accuracy to DETR/Faster R-CNN on PubTables-1M, PubTabNet, FinTabNet, and ICDAR-2019 while avoiding heavy image-based models. By leveraging OCR output and a transitive clustering framework, the method is robust to rotation and document layout diversity and provides a unified, end-to-end adjacency-based representation for tables and their components.

Abstract

We present a novel deep-learning-based method to cluster words in documents which we apply to detect and recognize tables given the OCR output. We interpret table structure bottom-up as a graph of relations between pairs of words (belonging to the same row, column, header, as well as to the same table) and use a transformer encoder model to predict its adjacency matrix. We demonstrate the performance of our method on the PubTables-1M dataset as well as PubTabNet and FinTabNet datasets. Compared to the current state-of-the-art detection methods such as DETR and Faster R-CNN, our method achieves similar or better accuracy, while requiring a significantly smaller model.

ClusterTabNet: Supervised clustering method for table detection and table structure recognition

TL;DR

ClusterTabNet reframes table detection and table structure recognition as supervised clustering over OCR word boxes, predicting an adjacency matrix of size with a transformer encoder. It outputs distinct adjacency heads for tables, rows, columns, and headers, enabling words to be grouped into cohesive table structures via connected components and post-processing. The approach is lightweight (approximately parameters) and optionally benefits from image patches, achieving competitive or superior accuracy to DETR/Faster R-CNN on PubTables-1M, PubTabNet, FinTabNet, and ICDAR-2019 while avoiding heavy image-based models. By leveraging OCR output and a transitive clustering framework, the method is robust to rotation and document layout diversity and provides a unified, end-to-end adjacency-based representation for tables and their components.

Abstract

We present a novel deep-learning-based method to cluster words in documents which we apply to detect and recognize tables given the OCR output. We interpret table structure bottom-up as a graph of relations between pairs of words (belonging to the same row, column, header, as well as to the same table) and use a transformer encoder model to predict its adjacency matrix. We demonstrate the performance of our method on the PubTables-1M dataset as well as PubTabNet and FinTabNet datasets. Compared to the current state-of-the-art detection methods such as DETR and Faster R-CNN, our method achieves similar or better accuracy, while requiring a significantly smaller model.
Paper Structure (24 sections, 3 equations, 4 figures, 4 tables)

This paper contains 24 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Schematic representation of the clustering network architecture with its input and output, as well as the loss applied during training.
  • Figure 2: Dice score and average precision for various choices of the hard threshold.
  • Figure 3: Examples from the test set showcasing some correct predictions.
  • Figure 5: Examples showcasing table detection and recognition for challenging documents from the Tobacco dataset tobacco. Several rows are wrongly skipped or joined, but the overall performance is satisfying, considering that the model was only trained on images from "clean" modern documents. t