Diving into the Depths of Spotting Text in Multi-Domain Noisy Scenes
Alloy Das, Sanket Biswas, Umapada Pal, Josep Lladós
TL;DR
DA-TextSpotter tackles text spotting in multi-domain noisy scenes by learning domain-agnostic representations through cross-domain pretraining on natural and underwater data, aided by a Real-ESRGAN enhancement module and a compact Swin backbone. It unifies detection and recognition in a set-prediction framework with polygon-based text localization, using a dual-decoder transformer to predict $X = \{(S^{(i)}, R^{(i)})\}_{i=1}^K$. The authors introduce the Under-Water Text (UWT) benchmark and show state-of-the-art performance across natural and underwater datasets, with substantial gains from the enhancement unit and domain-generalization strategies. These results demonstrate the practicality of domain-agnostic training for deployment in challenging environments (e.g., underwater robotics) and point to future work in domain-incremental learning and faster attention modules.
Abstract
When used in a real-world noisy environment, the capacity to generalize to multiple domains is essential for any autonomous scene text spotting system. However, existing state-of-the-art methods employ pretraining and fine-tuning strategies on natural scene datasets, which do not exploit the feature interaction across other complex domains. In this work, we explore and investigate the problem of domain-agnostic scene text spotting, i.e., training a model on multi-domain source data such that it can directly generalize to target domains rather than being specialized for a specific domain or scenario. In this regard, we present the community a text spotting validation benchmark called Under-Water Text (UWT) for noisy underwater scenes to establish an important case study. Moreover, we also design an efficient super-resolution based end-to-end transformer baseline called DA-TextSpotter which achieves comparable or superior performance over existing text spotting architectures for both regular and arbitrary-shaped scene text spotting benchmarks in terms of both accuracy and model efficiency. The dataset, code and pre-trained models will be released upon acceptance.
