Decoder Pre-Training with only Text for Scene Text Recognition

Shuai Zhao; Yongkun Du; Zhineng Chen; Yu-Gang Jiang

Decoder Pre-Training with only Text for Scene Text Recognition

Shuai Zhao, Yongkun Du, Zhineng Chen, Yu-Gang Jiang

TL;DR

DPTR addresses the domain gap in scene text recognition by pre-training the decoder with text-derived CLIP embeddings rather than image-text paired data. It introduces Offline Random Perturbation to diversify representations by injecting cropped CLIP image features, and a Feature Merge Unit to emphasize foreground characters during fine-tuning. The method is model-agnostic and delivers gains across English, Chinese, and multilingual STR decoders, achieving competitive or state-of-the-art results. This work demonstrates the potential of large vision-language models to enhance OCR tasks without requiring large-scale labelled real-image text data.

Abstract

Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder. An Offline Randomized Perturbation (ORP) strategy is introduced. It enriches the diversity of text embeddings by incorporating natural image embeddings extracted from the CLIP image encoder, effectively directing the decoder to acquire the potential representations of real images. In addition, we introduce a Feature Merge Unit (FMU) that guides the extracted visual embeddings focusing on the character foreground within the text image, thereby enabling the pre-trained decoder to work more efficiently and accurately. Extensive experiments across various STR decoders and language recognition tasks underscore the broad applicability and remarkable performance of DPTR, providing a novel insight for STR pre-training. Code is available at https://github.com/Topdu/OpenOCR

Decoder Pre-Training with only Text for Scene Text Recognition

TL;DR

Abstract

Paper Structure (11 sections, 7 equations, 7 figures, 8 tables)

This paper contains 11 sections, 7 equations, 7 figures, 8 tables.

Introduction
Related Work
Method
Decoder Pre-training
Model Fine-tuning
Experiment
Datasets
Experimental Settings
Ablation Study
Comparisons with State-of-the-Arts
Conclusion

Figures (7)

Figure 1: CLIP similarity computed by cross product using the text embedding's [EOS] token and the image embedding's [CLS] token. The text embeddings are more similar to embeddings of real images rather than synthetic images.
Figure 2: The pipeline of DPTR. We pre-train the decoder by encoding the prompt text following the template "a photo of a 'label'" using the CLIP text encoder. An Offline Random Perturbation (ORP) is incorporated to prevent model overfitting. Then the entire model undergoes fine-tuning using labelled text images. A Feature Merge Unit (FMU) is developed to guide the model's visual attention towards foreground characters. $\mathcal{L}_{ce}$ denotes the cross-entropy loss.
Figure 3: Two examples of decoder attention map comparison between $Synth$ (left) and $DPTR$ (right).
Figure 4: Comparison of CLIP text feature distribution with different noise ratios. Each is represented by a distinct color.
Figure 5: Character distribution visualization of the decoder pre-trained by $Synth$ and $DPTR$. Point color represents the character category. In (a), '+' and 'x' represent two incorrect predictions, e.g., '2' and '0', whereas in (b), they are correctly recognized.
...and 2 more figures

Decoder Pre-Training with only Text for Scene Text Recognition

TL;DR

Abstract

Decoder Pre-Training with only Text for Scene Text Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)