Table of Contents
Fetching ...

Decoder Pre-Training with only Text for Scene Text Recognition

Shuai Zhao, Yongkun Du, Zhineng Chen, Yu-Gang Jiang

TL;DR

DPTR addresses the domain gap in scene text recognition by pre-training the decoder with text-derived CLIP embeddings rather than image-text paired data. It introduces Offline Random Perturbation to diversify representations by injecting cropped CLIP image features, and a Feature Merge Unit to emphasize foreground characters during fine-tuning. The method is model-agnostic and delivers gains across English, Chinese, and multilingual STR decoders, achieving competitive or state-of-the-art results. This work demonstrates the potential of large vision-language models to enhance OCR tasks without requiring large-scale labelled real-image text data.

Abstract

Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder. An Offline Randomized Perturbation (ORP) strategy is introduced. It enriches the diversity of text embeddings by incorporating natural image embeddings extracted from the CLIP image encoder, effectively directing the decoder to acquire the potential representations of real images. In addition, we introduce a Feature Merge Unit (FMU) that guides the extracted visual embeddings focusing on the character foreground within the text image, thereby enabling the pre-trained decoder to work more efficiently and accurately. Extensive experiments across various STR decoders and language recognition tasks underscore the broad applicability and remarkable performance of DPTR, providing a novel insight for STR pre-training. Code is available at https://github.com/Topdu/OpenOCR

Decoder Pre-Training with only Text for Scene Text Recognition

TL;DR

DPTR addresses the domain gap in scene text recognition by pre-training the decoder with text-derived CLIP embeddings rather than image-text paired data. It introduces Offline Random Perturbation to diversify representations by injecting cropped CLIP image features, and a Feature Merge Unit to emphasize foreground characters during fine-tuning. The method is model-agnostic and delivers gains across English, Chinese, and multilingual STR decoders, achieving competitive or state-of-the-art results. This work demonstrates the potential of large vision-language models to enhance OCR tasks without requiring large-scale labelled real-image text data.

Abstract

Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on extensive real image-text pairs, effectively align images and text in a unified embedding space, suggesting the potential to derive the representations of real images from text alone. Building upon this premise, we introduce a novel method named Decoder Pre-training with only text for STR (DPTR). DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder. An Offline Randomized Perturbation (ORP) strategy is introduced. It enriches the diversity of text embeddings by incorporating natural image embeddings extracted from the CLIP image encoder, effectively directing the decoder to acquire the potential representations of real images. In addition, we introduce a Feature Merge Unit (FMU) that guides the extracted visual embeddings focusing on the character foreground within the text image, thereby enabling the pre-trained decoder to work more efficiently and accurately. Extensive experiments across various STR decoders and language recognition tasks underscore the broad applicability and remarkable performance of DPTR, providing a novel insight for STR pre-training. Code is available at https://github.com/Topdu/OpenOCR
Paper Structure (11 sections, 7 equations, 7 figures, 8 tables)

This paper contains 11 sections, 7 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: CLIP similarity computed by cross product using the text embedding's [EOS] token and the image embedding's [CLS] token. The text embeddings are more similar to embeddings of real images rather than synthetic images.
  • Figure 2: The pipeline of DPTR. We pre-train the decoder by encoding the prompt text following the template "a photo of a 'label'" using the CLIP text encoder. An Offline Random Perturbation (ORP) is incorporated to prevent model overfitting. Then the entire model undergoes fine-tuning using labelled text images. A Feature Merge Unit (FMU) is developed to guide the model's visual attention towards foreground characters. $\mathcal{L}_{ce}$ denotes the cross-entropy loss.
  • Figure 3: Two examples of decoder attention map comparison between $Synth$ (left) and $DPTR$ (right).
  • Figure 4: Comparison of CLIP text feature distribution with different noise ratios. Each is represented by a distinct color.
  • Figure 5: Character distribution visualization of the decoder pre-trained by $Synth$ and $DPTR$. Point color represents the character category. In (a), '+' and 'x' represent two incorrect predictions, e.g., '2' and '0', whereas in (b), they are correctly recognized.
  • ...and 2 more figures