Table of Contents
Fetching ...

TextMamba: Scene Text Detector with Mamba

Qiyan Zhao, Yue Yan, Da-Han Wang

TL;DR

TextMamba tackles the challenge of modeling long-range dependencies in scene text detection by integrating Mamba's SS2D selective scanning with deformable attention in a hybrid Mix-SSM encoder. It introduces the Mix-SSM block, a Dual-scale FFN, and an Embedding Pyramid Enhancement Module to enable sparse, cross-scale, token-level fusion and robust polygon regression. The approach achieves state-of-the-art or competitive results on CTW1500, TotalText, and ICDAR19-ArT, highlighting improved recall, precision, and overall F-measure while maintaining efficiency. These results suggest that combining selective long-range modeling with multi-scale token fusion can significantly enhance robustness to arbitrarily shaped text in real-world scenes, with potential for lightweight future variants.

Abstract

In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder's ability to extract relevant information from long sequences. We adopt the Top\_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7\%, 89.2\%, and 78.5\% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.

TextMamba: Scene Text Detector with Mamba

TL;DR

TextMamba tackles the challenge of modeling long-range dependencies in scene text detection by integrating Mamba's SS2D selective scanning with deformable attention in a hybrid Mix-SSM encoder. It introduces the Mix-SSM block, a Dual-scale FFN, and an Embedding Pyramid Enhancement Module to enable sparse, cross-scale, token-level fusion and robust polygon regression. The approach achieves state-of-the-art or competitive results on CTW1500, TotalText, and ICDAR19-ArT, highlighting improved recall, precision, and overall F-measure while maintaining efficiency. These results suggest that combining selective long-range modeling with multi-scale token fusion can significantly enhance robustness to arbitrarily shaped text in real-world scenes, with potential for lightweight future variants.

Abstract

In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder's ability to extract relevant information from long sequences. We adopt the Top\_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7\%, 89.2\%, and 78.5\% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.

Paper Structure

This paper contains 23 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of Self-Attention 8 and Mamba's SS2D 15. The red box indicates the query patch, and the patch transparency indicates the corresponding of degree information loss. $x, y$ in S6 block are input and output variables respectively and $\bar{A}$, $\bar{B}$, $\bar{C}$ are learnable parameters.
  • Figure 2: The overall framework of our method.
  • Figure 3: Qualitative results of our method on TotalText.
  • Figure 4: Qualitative results of our method on CTW1500.
  • Figure 5: Visualization of detection and attention results.
  • ...and 1 more figures