Efficiently Leveraging Linguistic Priors for Scene Text Spotting
Nguyen Nguyen, Yapeng Tian, Chenliang Xu
TL;DR
This work tackles the underutilization of linguistic priors in scene text spotting by replacing one-hot labels with soft distributions derived from large language models, enabling word-level character relations to guide detection and recognition without additional training or inference cost. It introduces centroid generation from CANINE-based character embeddings and soft distribution generation, where per-character distributions D_i are computed as $D_i = \frac{\exp(W_i^T x_j)}{\sum_k \exp(W_k^T x_j)}$ and optimized against language-informed targets with KL divergence $L = \sum_i D^{(i)} \log(D^{(i)}/P^{(i)})$ under a noise-threshold $T = 0.85$. The method demonstrates consistent, significant improvements in scene text spotting across Total-Text, ICDAR15, and SCUT-CTW1500, and achieves state-of-the-art results on several benchmarks, even when using dictionaries or external data (ED). It also boosts scene text recognition by leveraging TargetDict and cross-domain linguistic priors, achieving competitive or superior performance without increasing model complexity, and is validated on backbones such as ABCNetv2, Mask TextSpotterv3, and CornerTransformer.
Abstract
Incorporating linguistic knowledge can improve scene text recognition, but it is questionable whether the same holds for scene text spotting, which typically involves text detection and recognition. This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. This allows the model to capture the relationship between characters in the same word. Additionally, we introduce a technique to generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. As a result, the newly created text distributions are more informative than pure one-hot encoding, leading to improved spotting and recognition performance. Our method is simple and efficient, and it can easily be integrated into existing auto-regressive-based approaches. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words. It significantly improves both state-of-the-art scene text spotting and recognition pipelines, achieving state-of-the-art results on several benchmarks.
