TextMamba: Scene Text Detector with Mamba
Qiyan Zhao, Yue Yan, Da-Han Wang
TL;DR
TextMamba tackles the challenge of modeling long-range dependencies in scene text detection by integrating Mamba's SS2D selective scanning with deformable attention in a hybrid Mix-SSM encoder. It introduces the Mix-SSM block, a Dual-scale FFN, and an Embedding Pyramid Enhancement Module to enable sparse, cross-scale, token-level fusion and robust polygon regression. The approach achieves state-of-the-art or competitive results on CTW1500, TotalText, and ICDAR19-ArT, highlighting improved recall, precision, and overall F-measure while maintaining efficiency. These results suggest that combining selective long-range modeling with multi-scale token fusion can significantly enhance robustness to arbitrarily shaped text in real-world scenes, with potential for lightweight future variants.
Abstract
In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder's ability to extract relevant information from long sequences. We adopt the Top\_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7\%, 89.2\%, and 78.5\% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.
