Bridging the Gap Between End-to-End and Two-Step Text Spotting

Mingxin Huang; Hongliang Li; Yuliang Liu; Xiang Bai; Lianwen Jin

Bridging the Gap Between End-to-End and Two-Step Text Spotting

Mingxin Huang, Hongliang Li, Yuliang Liu, Xiang Bai, Lianwen Jin

TL;DR

Bridging Text Spotting addresses the gap between modular two-step and end-to-end text spotting by freezing independently trained detectors and recognizers and connecting them with a zero-initialized Bridge, aided by an Adapter for end-to-end learning. The Bridge merges large-receptive-field detection features with high-resolution recognition inputs, enabling end-to-end optimization without retraining the fixed modules. The approach achieves strong results on Total-Text (83.3%), CTW1500 (69.8%), and ICDAR 2015 (89.5%), with an average improvement of 4.4% across detector-recognizer pairings. This work offers a practical path to leverage mature components for end-to-end text spotting with reduced data and training time.

Abstract

Modularity plays a crucial role in the development and maintenance of complex systems. While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub-optimal performance seen in traditional two-step methodologies, the two-step methods continue to be favored in many competitions and practical settings due to their superior modularity. In this paper, we introduce Bridging Text Spotting, a novel approach that resolves the error accumulation and suboptimal performance issues in two-step methods while retaining modularity. To achieve this, we adopt a well-trained detector and recognizer that are developed and trained independently and then lock their parameters to preserve their already acquired capabilities. Subsequently, we introduce a Bridge that connects the locked detector and recognizer through a zero-initialized neural network. This zero-initialized neural network, initialized with weights set to zeros, ensures seamless integration of the large receptive field features in detection into the locked recognizer. Furthermore, since the fixed detector and recognizer cannot naturally acquire end-to-end optimization features, we adopt the Adapter to facilitate their efficient learning of these features. We demonstrate the effectiveness of the proposed method through extensive experiments: Connecting the latest detector and recognizer through Bridging Text Spotting, we achieved an accuracy of 83.3% on Total-Text, 69.8% on CTW1500, and 89.5% on ICDAR 2015. The code is available at https://github.com/mxin262/Bridging-Text-Spotting.

Bridging the Gap Between End-to-End and Two-Step Text Spotting

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 6 figures, 7 tables)

This paper contains 19 sections, 10 equations, 6 figures, 7 tables.

Introduction
Related Work
Methodology
Overall Architecture
Bridge
Adapter
Optimization
Experiments
Implementation Details
Comparison with State-of-the-art Methods
Ablation Studies
Ablation Study of The Bridge.
Ablation Study of The Adapter.
Ablation Study of the Zero-initialized Weight in Bridge.
Ablation Study of The Number of Transformer Layers.
...and 4 more sections

Figures (6)

Figure 1: Comparison between the proposed paradigm with existing text spotting paradigms. Our pipeline achieves better performance with high modularity. We adopt the latest detector zhang2023arbitrary and text spotter ye2023deepsolo to test the training time of the two-step and end-to-end methods, respectively. The training time is evaluated on the RTX-3090. Det1. means the original detector. Det2. means the new detector. Rec. means the text recognizer. Bri. mean the proposed Bridge.
Figure 2: The overall architecture of bridging text spotting. Rec. means the recognition. Crop represents the crop operation. The predictions of the detector are used to crop the text regions.
Figure 3: Illustration of Adapter. The neural network refers to the fundamental building blocks of a neural network, such as a multi-head attention block or a transformer block. $\mathbf{W_1}$ and $\mathbf{W_2}$ represent the linear layer. Act means the activation function. All normalization layers in the recognizer are used to tune.
Figure 4: Ablative study of the zero-initialized weight in Bridge. “F” indicates F-measure in end-to-end text spotting results on Total-Text.
Figure 5: Effectiveness of Bridge. Red boxes indicate recognition errors due to inaccurate detection results. Zoom in for best view.
...and 1 more figures

Bridging the Gap Between End-to-End and Two-Step Text Spotting

TL;DR

Abstract

Bridging the Gap Between End-to-End and Two-Step Text Spotting

Authors

TL;DR

Abstract

Table of Contents

Figures (6)