Table of Contents
Fetching ...

Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes

Alloy Das, Sanket Biswas, Umapada Pal, Josep Lladós

TL;DR

DA-TextSpotter tackles text spotting in multi-domain noisy scenes by learning domain-agnostic representations through cross-domain pretraining on natural and underwater data, aided by a Real-ESRGAN enhancement module and a compact Swin backbone. It unifies detection and recognition in a set-prediction framework with polygon-based text localization, using a dual-decoder transformer to predict $X = \{(S^{(i)}, R^{(i)})\}_{i=1}^K$. The authors introduce the Under-Water Text (UWT) benchmark and show state-of-the-art performance across natural and underwater datasets, with substantial gains from the enhancement unit and domain-generalization strategies. These results demonstrate the practicality of domain-agnostic training for deployment in challenging environments (e.g., underwater robotics) and point to future work in domain-incremental learning and faster attention modules.

Abstract

When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models will be released upon acceptance.

Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes

TL;DR

DA-TextSpotter tackles text spotting in multi-domain noisy scenes by learning domain-agnostic representations through cross-domain pretraining on natural and underwater data, aided by a Real-ESRGAN enhancement module and a compact Swin backbone. It unifies detection and recognition in a set-prediction framework with polygon-based text localization, using a dual-decoder transformer to predict . The authors introduce the Under-Water Text (UWT) benchmark and show state-of-the-art performance across natural and underwater datasets, with substantial gains from the enhancement unit and domain-generalization strategies. These results demonstrate the practicality of domain-agnostic training for deployment in challenging environments (e.g., underwater robotics) and point to future work in domain-incremental learning and faster attention modules.

Abstract

When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models will be released upon acceptance.
Paper Structure (11 sections, 3 equations, 5 figures, 7 tables)

This paper contains 11 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Qualitative comparison with SOTA approaches: The above figure shows a sample test image from the TotalText dataset and the end-to-end text spotting performance of DA-TextSpotter compared with others.
  • Figure 2: The overall framework of DA-TextSpotter consisting of Super-Resolution, Feature Extraction and Text Spotting units sequentially.
  • Figure 3: Illustration of the effectiveness of the Super-Resolution unit for noise removal on underwater (First case) and natural (second case) scenes.
  • Figure 4: Illustrating the advantage of Swin-Transformer over the ResNet-50 as a backbone generating better-localized representation.
  • Figure 5: Illustration of our method on different datasets. 1st column from Total-Text, 2nd column from CTW1500, 3rd column from ICDAR15, 4th and 5th columns from UWT. [Use 200% zoom for better visualization of the qualitative results]