Efficiently Leveraging Linguistic Priors for Scene Text Spotting

Nguyen Nguyen; Yapeng Tian; Chenliang Xu

Efficiently Leveraging Linguistic Priors for Scene Text Spotting

Nguyen Nguyen, Yapeng Tian, Chenliang Xu

TL;DR

This work tackles the underutilization of linguistic priors in scene text spotting by replacing one-hot labels with soft distributions derived from large language models, enabling word-level character relations to guide detection and recognition without additional training or inference cost. It introduces centroid generation from CANINE-based character embeddings and soft distribution generation, where per-character distributions D_i are computed as $D_i = \frac{\exp(W_i^T x_j)}{\sum_k \exp(W_k^T x_j)}$ and optimized against language-informed targets with KL divergence $L = \sum_i D^{(i)} \log(D^{(i)}/P^{(i)})$ under a noise-threshold $T = 0.85$. The method demonstrates consistent, significant improvements in scene text spotting across Total-Text, ICDAR15, and SCUT-CTW1500, and achieves state-of-the-art results on several benchmarks, even when using dictionaries or external data (ED). It also boosts scene text recognition by leveraging TargetDict and cross-domain linguistic priors, achieving competitive or superior performance without increasing model complexity, and is validated on backbones such as ABCNetv2, Mask TextSpotterv3, and CornerTransformer.

Abstract

Incorporating linguistic knowledge can improve scene text recognition, but it is questionable whether the same holds for scene text spotting, which typically involves text detection and recognition. This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. This allows the model to capture the relationship between characters in the same word. Additionally, we introduce a technique to generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. As a result, the newly created text distributions are more informative than pure one-hot encoding, leading to improved spotting and recognition performance. Our method is simple and efficient, and it can easily be integrated into existing auto-regressive-based approaches. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words. It significantly improves both state-of-the-art scene text spotting and recognition pipelines, achieving state-of-the-art results on several benchmarks.

Efficiently Leveraging Linguistic Priors for Scene Text Spotting

TL;DR

and optimized against language-informed targets with KL divergence

under a noise-threshold

. The method demonstrates consistent, significant improvements in scene text spotting across Total-Text, ICDAR15, and SCUT-CTW1500, and achieves state-of-the-art results on several benchmarks, even when using dictionaries or external data (ED). It also boosts scene text recognition by leveraging TargetDict and cross-domain linguistic priors, achieving competitive or superior performance without increasing model complexity, and is validated on backbones such as ABCNetv2, Mask TextSpotterv3, and CornerTransformer.

Abstract

Paper Structure (16 sections, 6 equations, 4 figures, 5 tables)

This paper contains 16 sections, 6 equations, 4 figures, 5 tables.

Introduction
Related Works
Language-guided Scene Text Spotting
Autoregressive-based Scene Text Recognition
Character Embedding
Centroid Generation
Soft Distribution Generation
Implementation Details
Experiments
Scene Text Spotting Experiments
Experiment on Total-Text
Experiment on ICDAR 15
Experiment on SCUT-CTW1500
Detection Results
Scene Text Recognition Experiments
...and 1 more sections

Figures (4)

Figure 1: Comparison of detection results with ( green shaded) and without ( red shaded) language knowledge prior guidance. Language prior is not only helpful for text recognition but also for text detection.
Figure 2: Traditional spotting pipeline (a) and proposed pipeline (b) on training. In the traditional pipeline, models use the one-hot label directly to guide the training for the scene text system. Our proposal replaces the one-hot encoding by using soft distributions for every label character and improving detection and recognition results. Besides, we proposed a method to leverage knowledge from pretrained language models and construct the soft distribution well-adapted to the scene text domain without finetuning language models.
Figure 3: Centroid Estimation. Visualization of character embedding for 9 characters $a,b,c,d,e,f,g,h,i$. Each cluster equivalent with a character, and black points in the center are the centroids generated by (\ref{['eq:equ2']}).
Figure 4: Qualitative results on Total-Text dataset. Our approach is more capable of recognizing scene texts than the baseline. These outputs are directly taken from the model when the dictionary is not used in the testing phase.

Efficiently Leveraging Linguistic Priors for Scene Text Spotting

TL;DR

Abstract

Efficiently Leveraging Linguistic Priors for Scene Text Spotting

Authors

TL;DR

Abstract

Table of Contents

Figures (4)