Table of Contents
Fetching ...

Efficiently Leveraging Linguistic Priors for Scene Text Spotting

Nguyen Nguyen, Yapeng Tian, Chenliang Xu

TL;DR

This work tackles the underutilization of linguistic priors in scene text spotting by replacing one-hot labels with soft distributions derived from large language models, enabling word-level character relations to guide detection and recognition without additional training or inference cost. It introduces centroid generation from CANINE-based character embeddings and soft distribution generation, where per-character distributions D_i are computed as $D_i = \frac{\exp(W_i^T x_j)}{\sum_k \exp(W_k^T x_j)}$ and optimized against language-informed targets with KL divergence $L = \sum_i D^{(i)} \log(D^{(i)}/P^{(i)})$ under a noise-threshold $T = 0.85$. The method demonstrates consistent, significant improvements in scene text spotting across Total-Text, ICDAR15, and SCUT-CTW1500, and achieves state-of-the-art results on several benchmarks, even when using dictionaries or external data (ED). It also boosts scene text recognition by leveraging TargetDict and cross-domain linguistic priors, achieving competitive or superior performance without increasing model complexity, and is validated on backbones such as ABCNetv2, Mask TextSpotterv3, and CornerTransformer.

Abstract

Incorporating linguistic knowledge can improve scene text recognition, but it is questionable whether the same holds for scene text spotting, which typically involves text detection and recognition. This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. This allows the model to capture the relationship between characters in the same word. Additionally, we introduce a technique to generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. As a result, the newly created text distributions are more informative than pure one-hot encoding, leading to improved spotting and recognition performance. Our method is simple and efficient, and it can easily be integrated into existing auto-regressive-based approaches. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words. It significantly improves both state-of-the-art scene text spotting and recognition pipelines, achieving state-of-the-art results on several benchmarks.

Efficiently Leveraging Linguistic Priors for Scene Text Spotting

TL;DR

This work tackles the underutilization of linguistic priors in scene text spotting by replacing one-hot labels with soft distributions derived from large language models, enabling word-level character relations to guide detection and recognition without additional training or inference cost. It introduces centroid generation from CANINE-based character embeddings and soft distribution generation, where per-character distributions D_i are computed as and optimized against language-informed targets with KL divergence under a noise-threshold . The method demonstrates consistent, significant improvements in scene text spotting across Total-Text, ICDAR15, and SCUT-CTW1500, and achieves state-of-the-art results on several benchmarks, even when using dictionaries or external data (ED). It also boosts scene text recognition by leveraging TargetDict and cross-domain linguistic priors, achieving competitive or superior performance without increasing model complexity, and is validated on backbones such as ABCNetv2, Mask TextSpotterv3, and CornerTransformer.

Abstract

Incorporating linguistic knowledge can improve scene text recognition, but it is questionable whether the same holds for scene text spotting, which typically involves text detection and recognition. This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. This allows the model to capture the relationship between characters in the same word. Additionally, we introduce a technique to generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. As a result, the newly created text distributions are more informative than pure one-hot encoding, leading to improved spotting and recognition performance. Our method is simple and efficient, and it can easily be integrated into existing auto-regressive-based approaches. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words. It significantly improves both state-of-the-art scene text spotting and recognition pipelines, achieving state-of-the-art results on several benchmarks.
Paper Structure (16 sections, 6 equations, 4 figures, 5 tables)

This paper contains 16 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison of detection results with ( green shaded) and without ( red shaded) language knowledge prior guidance. Language prior is not only helpful for text recognition but also for text detection.
  • Figure 2: Traditional spotting pipeline (a) and proposed pipeline (b) on training. In the traditional pipeline, models use the one-hot label directly to guide the training for the scene text system. Our proposal replaces the one-hot encoding by using soft distributions for every label character and improving detection and recognition results. Besides, we proposed a method to leverage knowledge from pretrained language models and construct the soft distribution well-adapted to the scene text domain without finetuning language models.
  • Figure 3: Centroid Estimation. Visualization of character embedding for 9 characters $a,b,c,d,e,f,g,h,i$. Each cluster equivalent with a character, and black points in the center are the centroids generated by (\ref{['eq:equ2']}).
  • Figure 4: Qualitative results on Total-Text dataset. Our approach is more capable of recognizing scene texts than the baseline. These outputs are directly taken from the model when the dictionary is not used in the testing phase.