Table of Contents
Fetching ...

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network

Yuchen Su, Zhineng Chen, Zhiwen Shao, Yuning Du, Zhilong Ji, Jinfeng Bai, Yong Zhou, Yu-Gang Jiang

TL;DR

LRANet introduces Low-Rank Approximation (LRA) to represent arbitrary-shaped text contours as a linear combination of eigenanchors learned from labeled contours, enabling compact and geometry-aware decoding. A dual assignment scheme combines dense supervision during training with sparse, fast inference to boost speed without sacrificing accuracy, implemented in a single-stage LRANet detector. Evaluations on CTW1500, Total-Text, and MSRA-TD500 show state-of-the-art performance with strong efficiency, validating the effectiveness of LRA for text-specific shape modeling and the practical benefit of the dual-assignment strategy. The approach offers a scalable, robust solution for real-time, arbitrary-shaped scene text detection and has potential for extension to text spotting.

Abstract

Recently, regression-based methods, which predict parameterized text shapes for text localization, have gained popularity in scene text detection. However, the existing parameterized text shape methods still have limitations in modeling arbitrary-shaped texts due to ignoring the utilization of text-specific shape information. Moreover, the time consumption of the entire pipeline has been largely overlooked, leading to a suboptimal overall inference speed. To address these issues, we first propose a novel parameterized text shape method based on low-rank approximation. Unlike other shape representation methods that employ data-irrelevant parameterization, our approach utilizes singular value decomposition and reconstructs the text shape using a few eigenvectors learned from labeled text contours. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. Next, we propose a dual assignment scheme for speed acceleration. It adopts a sparse assignment branch to accelerate the inference speed, and meanwhile, provides ample supervised signals for training through a dense assignment branch. Building upon these designs, we implement an accurate and efficient arbitrary-shaped text detector named LRANet. Extensive experiments are conducted on several challenging benchmarks, demonstrating the superior accuracy and efficiency of LRANet compared to state-of-the-art methods. Code is available at: \url{https://github.com/ychensu/LRANet.git}

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network

TL;DR

LRANet introduces Low-Rank Approximation (LRA) to represent arbitrary-shaped text contours as a linear combination of eigenanchors learned from labeled contours, enabling compact and geometry-aware decoding. A dual assignment scheme combines dense supervision during training with sparse, fast inference to boost speed without sacrificing accuracy, implemented in a single-stage LRANet detector. Evaluations on CTW1500, Total-Text, and MSRA-TD500 show state-of-the-art performance with strong efficiency, validating the effectiveness of LRA for text-specific shape modeling and the practical benefit of the dual-assignment strategy. The approach offers a scalable, robust solution for real-time, arbitrary-shaped scene text detection and has potential for extension to text spotting.

Abstract

Recently, regression-based methods, which predict parameterized text shapes for text localization, have gained popularity in scene text detection. However, the existing parameterized text shape methods still have limitations in modeling arbitrary-shaped texts due to ignoring the utilization of text-specific shape information. Moreover, the time consumption of the entire pipeline has been largely overlooked, leading to a suboptimal overall inference speed. To address these issues, we first propose a novel parameterized text shape method based on low-rank approximation. Unlike other shape representation methods that employ data-irrelevant parameterization, our approach utilizes singular value decomposition and reconstructs the text shape using a few eigenvectors learned from labeled text contours. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. Next, we propose a dual assignment scheme for speed acceleration. It adopts a sparse assignment branch to accelerate the inference speed, and meanwhile, provides ample supervised signals for training through a dense assignment branch. Building upon these designs, we implement an accurate and efficient arbitrary-shaped text detector named LRANet. Extensive experiments are conducted on several challenging benchmarks, demonstrating the superior accuracy and efficiency of LRANet compared to state-of-the-art methods. Code is available at: \url{https://github.com/ychensu/LRANet.git}
Paper Structure (24 sections, 10 equations, 10 figures, 10 tables)

This paper contains 24 sections, 10 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: The comparisons of several popular scene text detectors on the CTW1500 dataset. Our method achieves the best trade-off between accuracy and efficiency.
  • Figure 2: Comparison of different parameterized text shape methods. Ground-truth contours are depicted in green, and the fitted curves are shown in red. The number of fitting parameters are $44$, $32$, $22$ and $\mathbf{14}$ from (a) to (d). Ours gets a superior contour representation using fewer parameters.
  • Figure 3: Illustration of the low-rank approximation representation. The ground-truth contour is depicted in green, with $u_1$, $u_2$, ..., $u_8$ and $u_9$ as eigenanchors. The text contour is approximated by a linear combination of the eigenanchors, and fitted curves using different numbers of eigenanchors are shown from left to right.
  • Figure 4: The overview of our LRANet, which is mainly composed of three modules: (a) the backbone and FPN for feature extraction, (b) shared head for joint optimization, and (c) LRA decoder to reconstruct the text shape. The classification branch and the regression branch are used for predicting the positive samples and LRA coefficients, respectively.
  • Figure 5: Visualization of the first six eigenanchors with different data representations. These eigenanchors are obtained from the CTW1500 training dataset.
  • ...and 5 more figures