Table of Contents
Fetching ...

LORE++: Logical Location Regression Network for Table Structure Recognition with Pre-training

Rujiao Long, Hangdi Xing, Zhibo Yang, Qi Zheng, Zhi Yu, Cong Yao, Fei Huang

TL;DR

This work reframes TSR as a joint spatial and logical location regression problem, introducing LORE to predict both the cell corners and the 2D logical grid coordinates directly from table images. By employing cascading regressors and inter-cell/intra-cell supervisions, LORE captures dependencies among logical locations and enables straightforward transformations to adjacency and markup representations without heavy post-processing. The authors further extend the approach with LORE++, a pre-trained variant using a Masked Autoencoder and a Logical Distance Prediction task, which significantly improves accuracy, generalization, and data efficiency on diverse benchmarks. Together, these results demonstrate that logical location regression is a competitive and scalable paradigm for TSR, with practical impact on robust table understanding across layouts and modalities.

Abstract

Table structure recognition (TSR) aims at extracting tables in images into machine-understandable formats. Recent methods solve this problem by predicting the adjacency relations of detected cell boxes or learning to directly generate the corresponding markup sequences from the table images. However, existing approaches either count on additional heuristic rules to recover the table structures, or face challenges in capturing long-range dependencies within tables, resulting in increased complexity. In this paper, we propose an alternative paradigm. We model TSR as a logical location regression problem and propose a new TSR framework called LORE, standing for LOgical location REgression network, which for the first time regresses logical location as well as spatial location of table cells in a unified network. Our proposed LORE is conceptually simpler, easier to train, and more accurate than other paradigms of TSR. Moreover, inspired by the persuasive success of pre-trained models on a number of computer vision and natural language processing tasks, we propose two pre-training tasks to enrich the spatial and logical representations at the feature level of LORE, resulting in an upgraded version called LORE++. The incorporation of pre-training in LORE++ has proven to enjoy significant advantages, leading to a substantial enhancement in terms of accuracy, generalization, and few-shot capability compared to its predecessor. Experiments on standard benchmarks against methods of previous paradigms demonstrate the superiority of LORE++, which highlights the potential and promising prospect of the logical location regression paradigm for TSR.

LORE++: Logical Location Regression Network for Table Structure Recognition with Pre-training

TL;DR

This work reframes TSR as a joint spatial and logical location regression problem, introducing LORE to predict both the cell corners and the 2D logical grid coordinates directly from table images. By employing cascading regressors and inter-cell/intra-cell supervisions, LORE captures dependencies among logical locations and enables straightforward transformations to adjacency and markup representations without heavy post-processing. The authors further extend the approach with LORE++, a pre-trained variant using a Masked Autoencoder and a Logical Distance Prediction task, which significantly improves accuracy, generalization, and data efficiency on diverse benchmarks. Together, these results demonstrate that logical location regression is a competitive and scalable paradigm for TSR, with practical impact on robust table understanding across layouts and modalities.

Abstract

Table structure recognition (TSR) aims at extracting tables in images into machine-understandable formats. Recent methods solve this problem by predicting the adjacency relations of detected cell boxes or learning to directly generate the corresponding markup sequences from the table images. However, existing approaches either count on additional heuristic rules to recover the table structures, or face challenges in capturing long-range dependencies within tables, resulting in increased complexity. In this paper, we propose an alternative paradigm. We model TSR as a logical location regression problem and propose a new TSR framework called LORE, standing for LOgical location REgression network, which for the first time regresses logical location as well as spatial location of table cells in a unified network. Our proposed LORE is conceptually simpler, easier to train, and more accurate than other paradigms of TSR. Moreover, inspired by the persuasive success of pre-trained models on a number of computer vision and natural language processing tasks, we propose two pre-training tasks to enrich the spatial and logical representations at the feature level of LORE, resulting in an upgraded version called LORE++. The incorporation of pre-training in LORE++ has proven to enjoy significant advantages, leading to a substantial enhancement in terms of accuracy, generalization, and few-shot capability compared to its predecessor. Experiments on standard benchmarks against methods of previous paradigms demonstrate the superiority of LORE++, which highlights the potential and promising prospect of the logical location regression paradigm for TSR.
Paper Structure (42 sections, 17 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 42 sections, 17 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: TSR paradigms using different table-structure representations. Here, $sr$, $er$, $sc$, $ec$ refer to the starting-row, ending-row, starting-column, and ending-column respectively.
  • Figure 2: A comparison between the usual regression (left) and the logical location regression (right). The typical regression hypothesis is that different targets are independently distributed. However, dependencies exist between logical indices, e.g., the logical location of the cell '70.6' is constrained by those of the four surrounding cells.
  • Figure 3: An illustration of LORE. It first locates table cells in the input image by key point segmentation. Then the logical locations are predicted along with the spatial locations. The cascading regressors and the inter-cell and intra-cell supervisions are employed to better model the dependencies and constraints between logical locations.
  • Figure 4: (Left) An illustration of the pre-training and fine-tuning framework of LORE++. The model is jointly pre-trained by the MAE task and the Logical Distance Prediction (LDP) task, which respectively corresponds to the spatial and logical location prediction task in the fine-tuning stage. (Right) The comparison of attention mask for MAE and LDP used in the encoder, which facilitates the model updating both the unmasked and masked patches in a single forward-pass. The embedding of masked patches will be replaced by the mask token when inputted into the spatial decoder.
  • Figure 5: Demonstration of the Logical Distance Prediction task with text region boxes (Red), grid columns (Green), and grid rows (Blue). In this example, the logical distances are both 2 in terms of row and column.
  • ...and 4 more figures