Table of Contents
Fetching ...

FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

Alloy Das, Sanket Biswas, Umapada Pal, Josep Lladós, Saumik Bhattacharya

TL;DR

FastTextSpotter tackles efficient multilingual scene text spotting by integrating a Swin-Tiny backbone with a Transformer encoder-decoder and a novel SAC2 self-attention module. Text instances are formulated as a set of control-point coordinates and character sequences, using dynamic reference-point sampling to accelerate training and inference. The model achieves competitive or superior results on ICDAR2015, Total-Text, CTW1500, and Vin-Text, while delivering higher FPS than many SOTA methods. The work demonstrates the practicality of transformer-based architectures for real-time, multilingual OCR and provides datasets, code, and pretrained models on GitHub.

Abstract

The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.

FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

TL;DR

FastTextSpotter tackles efficient multilingual scene text spotting by integrating a Swin-Tiny backbone with a Transformer encoder-decoder and a novel SAC2 self-attention module. Text instances are formulated as a set of control-point coordinates and character sequences, using dynamic reference-point sampling to accelerate training and inference. The model achieves competitive or superior results on ICDAR2015, Total-Text, CTW1500, and Vin-Text, while delivering higher FPS than many SOTA methods. The work demonstrates the practicality of transformer-based architectures for real-time, multilingual OCR and provides datasets, code, and pretrained models on GitHub.

Abstract

The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.
Paper Structure (12 sections, 8 equations, 5 figures, 4 tables)

This paper contains 12 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Trade-off between text spotting performance h-mean vs number of training iterations: The blue curve indicates the model without the SAC2 attention module while the orange curve depicts the model performance with our proposed SAC2 module.
  • Figure 2: Overview of FastTextSpotter illustrating a Swin Transformer visual backbone with a Transformer Encoder-Decoder framework. Key features include the SAC2 attention module, dual decoders for accurate text localization and recognition, and the Reference Point Sampling system for effective text detection across various shapes and languages.
  • Figure 3: Visualization of attention maps for Resnet-50 feature backbone. (L) to (R) shows attention maps starting from the first layer.
  • Figure 4: Visualization of attention maps for Swin-Tiny feature backbone. (L) to (R) shows attention maps from the first layer.
  • Figure 5: Some illustration of our method on different datasets. Zoom in for better visualization. First two images from Total-Text, third and fourth images from CTW1500, fifth and sixth images from ICDAR15, and the last two images from Vintext.