SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

Xianfu Cheng; Weixiao Zhou; Xiang Li; Jian Yang; Hang Zhang; Tao Sun; Wei Zhang; Yuying Mai; Tongliang Li; Xiaoming Chen; Zhoujun Li

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

Xianfu Cheng, Weixiao Zhou, Xiang Li, Jian Yang, Hang Zhang, Tao Sun, Wei Zhang, Yuying Mai, Tongliang Li, Xiaoming Chen, Zhoujun Li

TL;DR

This work tackles Scene Text Recognition (STR) with a focus on inference efficiency. It introduces SVIPTR, a Vision Permutable Extractor that uses a pyramid architecture and a mix of local and global self-attention to fuse visual and semantic cues for robust, length-insensitive STR. Key contributions include a four-stage architecture with height dimension reduction, patch embedding, attention-permutation strategies, and a CTC decoder, plus extensive ablation and attention visualizations. On English and Chinese benchmarks, SVIPTR variants achieve competitive or state-of-the-art accuracy while delivering significantly faster inference, including a Tiny variant with very fast throughput (~3.3 ms per image on an NVIDIA V100) and a Large variant with strong accuracy, making it practical for real-world deployment.

Abstract

Scene Text Recognition (STR) is an important and challenging upstream task for building structured information databases, that involves recognizing text within images of natural scenes. Although current state-of-the-art (SOTA) models for STR exhibit high performance, they typically suffer from low inference efficiency due to their reliance on hybrid architectures comprised of visual encoders and sequence decoders. In this work, we propose a VIsion Permutable extractor for fast and efficient Scene Text Recognition (SVIPTR), which achieves an impressive balance between high performance and rapid inference speeds in the domain of STR. Specifically, SVIPTR leverages a visual-semantic extractor with a pyramid structure, characterized by the Permutation and combination of local and global self-attention layers. This design results in a lightweight and efficient model and its inference is insensitive to input length. Extensive experimental results on various standard datasets for both Chinese and English scene text recognition validate the superiority of SVIPTR. Notably, the SVIPTR-T (Tiny) variant delivers highly competitive accuracy on par with other lightweight models and achieves SOTA inference speeds. Meanwhile, the SVIPTR-L (Large) attains SOTA accuracy in single-encoder-type models, while maintaining a low parameter count and favorable inference speed. Our proposed method provides a compelling solution for the STR challenge, which greatly benefits real-world applications requiring fast and efficient STR. The code is publicly available at https://github.com/cxfyxl/VIPTR.

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

TL;DR

Abstract

Paper Structure (19 sections, 4 figures, 6 tables)

This paper contains 19 sections, 4 figures, 6 tables.

Introduction
Related Work
CNN-Based Vision Modules.
Transformer-Based Vision Modules.
Sparse Self-attention and Spatial Modeling.
Method
Design Guidelines
Overall Architecture
Patch Embedding
Character and String Modeling
Height Dimension Reduction
Prediction
Vision Permutation and Architecture Variants
Experiments and Discussion
Experiment Settings
...and 4 more sections

Figures (4)

Figure 1: Model Architecture Evolution for Scene Text Recognition.
Figure 2: Overall architecture of the proposed SVIPTR. It is a four-stage network with three-stage height progressively decreasing. In each stage, a series of attention-mixing blocks are carried out and followed by a subsampling or combining operation. In particular, SVIPTR designs two visual feature fusion modes in vision permutation blocks: series and parallel. At last, the recognition is conducted by the CTC decoder.
Figure 3: Illustration of (a) Cross-Shaped Windows Self-Attention (CSWin), (b) Decomposed Manhattan Self-Attention (D-MaSA), (c) Multi-Head Self-Attention (MHSA), and (d) Overlapping Spatial Reduction Attention (OSRA).
Figure 4: Visualization of the attention maps for SVIPTRv2-T. The character in red means it is missed or an error.

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

TL;DR

Abstract

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

Authors

TL;DR

Abstract

Table of Contents

Figures (4)