Table of Contents
Fetching ...

Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Tianci Bi, Xiaoyi Zhang, Zhizheng Zhang, Wenxuan Xie, Cuiling Lan, Yan Lu, Nanning Zheng

TL;DR

Text Grouping Adapter (TGA) introduces a detector-agnostic module to empower pre-trained text detectors for scene text layout analysis. It combines Text Instance Feature Assembling (TIFA) and Group Mask Prediction (GMP) to produce text group representations and an affinity matrix, using one-to-many Hungarian matching and a Dice-based supervision to learn cohesive group features. The approach supports freezing or fine-tuning detectors and can cascade for word-to-line-to-paragraph grouping, achieving significant improvements on HierText over baselines. By reusing large-scale detection models and datasets, TGA offers a practical, scalable path to integrated text detection and layout understanding with strong empirical gains. The work also highlights broader potential for adapters to model inter-instance relations in diverse vision tasks.

Abstract

Significant progress has been made in scene text detection models since the rise of deep learning, but scene text layout analysis, which aims to group detected text instances as paragraphs, has not kept pace. Previous works either treated text detection and grouping using separate models, or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper, we present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis, allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures, TGA takes detected text regions and image features as universal inputs to assemble text instance features. To capture broader contextual information for layout analysis, we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance, simultaneously inheriting generalized text detection ability from pre-training. In the case of full parameter fine-tuning, we can further improve layout analysis performance.

Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

TL;DR

Text Grouping Adapter (TGA) introduces a detector-agnostic module to empower pre-trained text detectors for scene text layout analysis. It combines Text Instance Feature Assembling (TIFA) and Group Mask Prediction (GMP) to produce text group representations and an affinity matrix, using one-to-many Hungarian matching and a Dice-based supervision to learn cohesive group features. The approach supports freezing or fine-tuning detectors and can cascade for word-to-line-to-paragraph grouping, achieving significant improvements on HierText over baselines. By reusing large-scale detection models and datasets, TGA offers a practical, scalable path to integrated text detection and layout understanding with strong empirical gains. The work also highlights broader potential for adapters to model inter-instance relations in diverse vision tasks.

Abstract

Significant progress has been made in scene text detection models since the rise of deep learning, but scene text layout analysis, which aims to group detected text instances as paragraphs, has not kept pace. Previous works either treated text detection and grouping using separate models, or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper, we present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis, allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures, TGA takes detected text regions and image features as universal inputs to assemble text instance features. To capture broader contextual information for layout analysis, we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance, simultaneously inheriting generalized text detection ability from pre-training. In the case of full parameter fine-tuning, we can further improve layout analysis performance.
Paper Structure (29 sections, 6 equations, 5 figures, 4 tables)

This paper contains 29 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Top: Visualization of the scene text detection and layout analysis tasks. The mask with the same color denotes detected as a group. Bottom: Comparison between (a) the previous work Unified Detector long2022towards and (b) proposed TGA. TGA also provides the flexibility of freezing or fine-tuning the pre-trained text detector.
  • Figure 2: Overview of proposed Text Grouping Adapter. The dashed boxes denote the matched group and instance. The text detector can be frozen or fine-tuned together with TGA when training. $\hat{\mathbf{M}}^{I}_{i}$ is the predicted instance mask of $I_i$. $\hat{\mathbf{M}}^{G}_{i}$ and $\mathbf{M}^{G}_{i}$ are predicted group mask of $I_i$ and assigned ground-truth one of $I_i$. To illustrate, a same group mask is duplicated as $\mathbf{M}^{G}_{p}$ and $\mathbf{M}^{G}_{q}$ and assigned to $I_p$ and $I_q$.
  • Figure 3: Details of Cascade TGA. Intermediate image features denotes the multi-scale features before summing in Word TGA.
  • Figure 4: Comparison between different stages of single TGA and Cascade TGA training under frozen text detector strategy.
  • Figure 5: Visualization of results on the validation set of the HierText Dataset: from left to right, the sequence includes the ground truth, line-based Unified Detector, TGA + MaskDINO-Swin-B and TGA + DeepSolo-ViTAE-S. (Zoom in for the best view)