Table of Contents
Fetching ...

InstructTable: Improving Table Structure Recognition Through Instructions

Boming Chen, Zining Wang, Zhentao Guo, Jianqiang Liu, Chen Duan, Yu Gu, Kai zhou, Pengfei Yan

Abstract

Table structure recognition (TSR) holds widespread practical importance by parsing tabular images into structured representations, yet encounters significant challenges when processing complex layouts involving merged or empty cells. Traditional visual-centric models rely exclusively on visual information while lacking crucial semantic support, thereby impeding accurate structural recognition in complex scenarios. Vision-language models leverage contextual semantics to enhance comprehension; however, these approaches underemphasize the modeling of visual structural information. To address these limitations, this paper introduces InstructTable, an instruction-guided multi-stage training TSR framework. Meticulously designed table instruction pre-training directs attention toward fine-grained structural patterns, enhancing comprehension of complex tables. Complementary TSR fine-tuning preserves robust visual information modeling, maintaining high-precision table parsing across diverse scenarios. Furthermore, we introduce Table Mix Expand (TME), an innovative template-free method for synthesizing large-scale authentic tabular data. Leveraging TME, we construct the Balanced Complex Dense Synthetic Tables (BCDSTab) benchmark, comprising 900 complex table images synthesized through our method to serve as a rigorous benchmark. Extensive experiments on multiple public datasets (FinTabNet, PubTabNet, MUSTARD) and BCDSTab demonstrate that InstructTable achieves state-of-the-art performance in TSR tasks. Ablation studies further confirm the positive impact of the proposed tabular-data-specific instructions and synthetic data.

InstructTable: Improving Table Structure Recognition Through Instructions

Abstract

Table structure recognition (TSR) holds widespread practical importance by parsing tabular images into structured representations, yet encounters significant challenges when processing complex layouts involving merged or empty cells. Traditional visual-centric models rely exclusively on visual information while lacking crucial semantic support, thereby impeding accurate structural recognition in complex scenarios. Vision-language models leverage contextual semantics to enhance comprehension; however, these approaches underemphasize the modeling of visual structural information. To address these limitations, this paper introduces InstructTable, an instruction-guided multi-stage training TSR framework. Meticulously designed table instruction pre-training directs attention toward fine-grained structural patterns, enhancing comprehension of complex tables. Complementary TSR fine-tuning preserves robust visual information modeling, maintaining high-precision table parsing across diverse scenarios. Furthermore, we introduce Table Mix Expand (TME), an innovative template-free method for synthesizing large-scale authentic tabular data. Leveraging TME, we construct the Balanced Complex Dense Synthetic Tables (BCDSTab) benchmark, comprising 900 complex table images synthesized through our method to serve as a rigorous benchmark. Extensive experiments on multiple public datasets (FinTabNet, PubTabNet, MUSTARD) and BCDSTab demonstrate that InstructTable achieves state-of-the-art performance in TSR tasks. Ablation studies further confirm the positive impact of the proposed tabular-data-specific instructions and synthetic data.

Paper Structure

This paper contains 19 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Visualized comparison among traditional visual-centric TSR (VCTSR) models, vision-language models (VLM), and InstructTable. By leveraging instruction pre-training and TSR fine-tuning to jointly model visual information and instruction dependencies, InstructTable enhance fine-grained structural comprehension of tables.
  • Figure 2: Implicit row problem visualization – identical images from divergent ground truths causing input-output misalignment, misleading model training. In the matrix, green regions denote normal cells while red dotted patterns indicate implicit rows.
  • Figure 3: TME synthesis pipeline. Authentic table data undergoes matrix processing, partitioning, splicing, content generating and validity checking to generate novel synthetic tables for scenario-agnostic data expansion. In the atomic cell matrix, "C" denotes independent cells, "L" denotes left-merged cells, "U" denotes up-merged cells, and "X" denotes bidirectionally merged (left and up) cells.
  • Figure 4: Main framework of InstructTable. The red dashed region indicates text encoding occurs only in training, where instruction embeddings are cached. During inference, cached vectors enable efficient processing without real-time text encoding.
  • Figure 5: Attention heatmaps of the cross-attention layer under four instruction groups. Visualizations reveal how task-specific instructions dynamically modulate attention focus across table regions during parsing.
  • ...and 3 more figures