Table of Contents
Fetching ...

LRANet++: Low-Rank Approximation Network for Accurate and Efficient Text Spotting

Yuchen Su, Zhineng Chen, Yongkun Du, Zuxuan Wu, Hongtao Xie, Yu-Gang Jiang

TL;DR

LRANet++ tackles the challenge of accurate and efficient end-to-end text spotting for arbitrary-shaped text by introducing a data-driven low-rank approximation (LRA) to model text contours and a triple assignment detection head that decouples learning from inference. The LRA uses a robust Fast Median Subspace to derive orthonormal orthanchors, enabling compact, stable contour representations; the triple assignment head combines a deep sparse teacher, a dense auxiliary, and a shallow sparse student to preserve accuracy while accelerating inference. A Transformer-based, TPS-aligned recognition head enables efficient end-to-end transcription via CTC decoding, aided by large-ratio image scaling to mitigate RoI-induced distortions. Extensive experiments across CTW1500, Total-Text, and multilingual benchmarks demonstrate state-of-the-art end-to-end F-measures and real-time speeds, validating the approach’s practical impact for robust, scalable text understanding in natural scenes.

Abstract

End-to-end text spotting aims to jointly optimize text detection and recognition within a unified framework. Despite significant progress, designing an accurate and efficient end-to-end text spotter for arbitrary-shaped text remains largely unsolved. We identify the primary bottleneck as the lack of a reliable and efficient text detection method. To address this, we propose a novel parameterized text shape method based on low-rank approximation for precise detection and a triple assignment detection head to enable fast inference. Specifically, unlike other shape representation methods that employ data-irrelevant parameterization, our data-driven approach derives a low-rank subspace directly from labeled text boundaries. To ensure this process is robust against the inherent annotation noise in this data, we utilize a specialized recovery method based on an $\ell_1$-norm formulation, which accurately reconstructs the text shape with only a few key orthogonal vectors. By exploiting the inherent shape correlation among different text contours, our method achieves consistency and compactness in shape representation. Next, the triple assignment scheme introduces a novel architecture where a deep sparse branch (for stabilized training) is used to guide the learning of an ultra-lightweight sparse branch (for accelerated inference), while a dense branch provides rich parallel supervision. Building upon these advancements, we integrate the enhanced detection module with a lightweight recognition branch to form an end-to-end text spotting framework, termed LRANet++, capable of accurately and efficiently spotting arbitrary-shaped text. Extensive experiments on several challenging benchmarks demonstrate the superiority of LRANet++ compared to state-of-the-art methods. Code will be available at: https://github.com/ychensu/LRANet-PP.git

LRANet++: Low-Rank Approximation Network for Accurate and Efficient Text Spotting

TL;DR

LRANet++ tackles the challenge of accurate and efficient end-to-end text spotting for arbitrary-shaped text by introducing a data-driven low-rank approximation (LRA) to model text contours and a triple assignment detection head that decouples learning from inference. The LRA uses a robust Fast Median Subspace to derive orthonormal orthanchors, enabling compact, stable contour representations; the triple assignment head combines a deep sparse teacher, a dense auxiliary, and a shallow sparse student to preserve accuracy while accelerating inference. A Transformer-based, TPS-aligned recognition head enables efficient end-to-end transcription via CTC decoding, aided by large-ratio image scaling to mitigate RoI-induced distortions. Extensive experiments across CTW1500, Total-Text, and multilingual benchmarks demonstrate state-of-the-art end-to-end F-measures and real-time speeds, validating the approach’s practical impact for robust, scalable text understanding in natural scenes.

Abstract

End-to-end text spotting aims to jointly optimize text detection and recognition within a unified framework. Despite significant progress, designing an accurate and efficient end-to-end text spotter for arbitrary-shaped text remains largely unsolved. We identify the primary bottleneck as the lack of a reliable and efficient text detection method. To address this, we propose a novel parameterized text shape method based on low-rank approximation for precise detection and a triple assignment detection head to enable fast inference. Specifically, unlike other shape representation methods that employ data-irrelevant parameterization, our data-driven approach derives a low-rank subspace directly from labeled text boundaries. To ensure this process is robust against the inherent annotation noise in this data, we utilize a specialized recovery method based on an -norm formulation, which accurately reconstructs the text shape with only a few key orthogonal vectors. By exploiting the inherent shape correlation among different text contours, our method achieves consistency and compactness in shape representation. Next, the triple assignment scheme introduces a novel architecture where a deep sparse branch (for stabilized training) is used to guide the learning of an ultra-lightweight sparse branch (for accelerated inference), while a dense branch provides rich parallel supervision. Building upon these advancements, we integrate the enhanced detection module with a lightweight recognition branch to form an end-to-end text spotting framework, termed LRANet++, capable of accurately and efficiently spotting arbitrary-shaped text. Extensive experiments on several challenging benchmarks demonstrate the superiority of LRANet++ compared to state-of-the-art methods. Code will be available at: https://github.com/ychensu/LRANet-PP.git

Paper Structure

This paper contains 37 sections, 11 equations, 9 figures, 20 tables, 1 algorithm.

Figures (9)

  • Figure 1: The comparisons between our LRANet++ and several popular scene text spotters on CTW1500 dataset. LRANet++ achieves the leading F-measure while running much faster.
  • Figure 2: Examples of F-measure variation under different IoU constraints. It can be observed that inaccurate detection results (e.g., IoU with GT less than 0.5) rarely lead to accurate, and that fully capitalizing on well-localized text regions requires a well-designed overall spotting architecture.
  • Figure 3: Illustration of the low-rank approximation representation. The GT contour is depicted in green, with $u_1$, $u_2$, ..., $u_5$ and $u_6$ as $orthanchors$. The text contour is approximated by a linear combination of the $orthanchors$. As we can see, only six $orthanchors$ can fit the curved text.
  • Figure 4: The overview of our LRANet++. It is mainly composed of four modules: (a) backbone and FPN for feature extraction, (b) triple assignment head for predicting LRA coefficients, (c) LRA decoder to reconstruct text shape, and (d) lightweight recognition head that transcribes internal features of the text instance after TPS alignment into text sequences.
  • Figure 5: The structural details of the recognition head. It comprises a four-stage network with progressively decreasing height, and recognition is ultimately performed through a linear prediction layer.
  • ...and 4 more figures