Table of Contents
Fetching ...

Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

Andrea Maracani, Savas Ozkan, Sijun Cho, Hyowon Kim, Eunchung Noh, Jeongwon Min, Cho Jung Min, Dookun Park, Mete Ozay

TL;DR

This work analyzes how to scale Scene Text Recognition (STR) models and reveals that decoder scaling yields substantial gains across vision encoders, contrasting with prior emphasis on encoder scaling. It introduces Cloze Self-Distillation (CSD) to combat label noise by distilling context-rich, cloze-refined predictions from a teacher into a student, augmented with knowledge distillation terms, and proposes a Differential Cross-Attention-based Decoder to reduce attention noise. Together with a Permutation Language Decoder architecture, these components enable state-of-the-art STR performance on $10$ of $11$ benchmarks using real data, while significantly reducing parameters and FLOPs. The approach demonstrates robust improvements across data regimes (Real and RBU), offering practical gains in accuracy and efficiency for real-world STR applications.

Abstract

Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.

Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation

TL;DR

This work analyzes how to scale Scene Text Recognition (STR) models and reveals that decoder scaling yields substantial gains across vision encoders, contrasting with prior emphasis on encoder scaling. It introduces Cloze Self-Distillation (CSD) to combat label noise by distilling context-rich, cloze-refined predictions from a teacher into a student, augmented with knowledge distillation terms, and proposes a Differential Cross-Attention-based Decoder to reduce attention noise. Together with a Permutation Language Decoder architecture, these components enable state-of-the-art STR performance on of benchmarks using real data, while significantly reducing parameters and FLOPs. The approach demonstrates robust improvements across data regimes (Real and RBU), offering practical gains in accuracy and efficiency for real-world STR applications.

Abstract

Scaling architectures have been proven effective for improving Scene Text Recognition (STR), but the individual contribution of vision encoder and text decoder scaling remain under-explored. In this work, we present an in-depth empirical analysis and demonstrate that, contrary to previous observations, scaling the decoder yields significant performance gains, always exceeding those achieved by encoder scaling alone. We also identify label noise as a key challenge in STR, particularly in real-world data, which can limit the effectiveness of STR models. To address this, we propose Cloze Self-Distillation (CSD), a method that mitigates label noise by distilling a student model from context-aware soft predictions and pseudolabels generated by a teacher model. Additionally, we enhance the decoder architecture by introducing differential cross-attention for STR. Our methodology achieves state-of-the-art performance on 10 out of 11 benchmarks using only real data, while significantly reducing the parameter size and computational costs.

Paper Structure

This paper contains 22 sections, 12 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Average word accuracy (%) on $11$ STR benchmarks for the models with ViT-T, ViT-S and ViT-B vision encoders and $4$ different decoder sizes (see Sec. \ref{['subsec:scaling_analysis']}). Results are compared with the previous state-of-the-art model, CLIP4STR zhao2023clip4str. Results using Real training dataset (3.3M images) are depicted with solid lines and circle markers, while results using RBU training dataset (6.5M images) are shown with dashed lines and diamond markers. The x-axis represents the total number of model parameters (in millions) on a logarithmic scale.
  • Figure 2: Examples of label inconsistencies and errors in the training set. For each image, we show the ground truth label (L) and the teacher-generated pseudolabel (P). Subfigures (a-c) illustrate typical label errors, such as spelling mistakes or missing characters. Subfigures (d,e) highlight label inconsistencies, where punctuation or occluded parts are not annotated. Subfigure (f) demonstrates a labelling error caused by severe degradation in the image quality.
  • Figure 3: The overall architecture of our STR model. Our model mainly consists of Vision Encoder$E$ and Text Decoder$D$. Details are given in the Sec. \ref{['sec:method']}.
  • Figure 4: Flow of Cloze Self-Distillation (CSD). Pseudolabels and soft predictions of a fixed teacher model, obtained with the cloze-filling approach, are distilled into a student model by minimizing the negative log likelihood (NLL) and the knowledge distillation (KD) objective, presented in Eq. \ref{['eq:CSD_objective']}.
  • Figure 5: Differential Cross-Attention used in our PLD decoder. For simplicity, the diagram shows a single head.
  • ...and 3 more figures