Table of Contents
Fetching ...

Self-distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach

Ziyin Zhang, Ning Lu, Minghui Liao, Yongshuai Huang, Cheng Li, Min Wang, Wei Peng

TL;DR

The paper presents Distillation CTC (DCTC), a module-free self-distillation loss for CTC-based text recognition that adds frame-wise supervision through a MAP-derived latent alignment $z^*$. By combining the standard CTC loss with a distillation term, and deriving a closed-form estimate for $z^*$ from the CTC gradient $\mathbf{G}$ and probabilities $\mathbf{P}$, DCTC addresses alignment inconsistency without extra parameters or training phases. Empirical results across English and Chinese benchmarks show up to $2.6\%$ accuracy gains while preserving inference speed, and analyses demonstrate improved latent alignment quality and more cohesive feature representations. The method offers a lightweight, practical improvement with strong model- and loss-wise performance gains in TR tasks.

Abstract

Text recognition methods are gaining rapid development. Some advanced techniques, e.g., powerful modules, language models, and un- and semi-supervised learning schemes, consecutively push the performance on public benchmarks forward. However, the problem of how to better optimize a text recognition model from the perspective of loss functions is largely overlooked. CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with accuracy degradation. This is because CTC loss emphasizes the optimization of the entire sequence target while neglecting to learn individual characters. We propose a self-distillation scheme for CTC-based model to address this issue. It incorporates a framewise regularization term in CTC loss to emphasize individual supervision, and leverages the maximizing-a-posteriori of latent alignment to solve the inconsistency problem that arises in distillation between CTC-based models. We refer to the regularized CTC loss as Distillation Connectionist Temporal Classification (DCTC) loss. DCTC loss is module-free, requiring no extra parameters, longer inference lag, or additional training data or phases. Extensive experiments on public benchmarks demonstrate that DCTC can boost text recognition model accuracy by up to 2.6%, without any of these drawbacks.

Self-distillation Regularized Connectionist Temporal Classification Loss for Text Recognition: A Simple Yet Effective Approach

TL;DR

The paper presents Distillation CTC (DCTC), a module-free self-distillation loss for CTC-based text recognition that adds frame-wise supervision through a MAP-derived latent alignment . By combining the standard CTC loss with a distillation term, and deriving a closed-form estimate for from the CTC gradient and probabilities , DCTC addresses alignment inconsistency without extra parameters or training phases. Empirical results across English and Chinese benchmarks show up to accuracy gains while preserving inference speed, and analyses demonstrate improved latent alignment quality and more cohesive feature representations. The method offers a lightweight, practical improvement with strong model- and loss-wise performance gains in TR tasks.

Abstract

Text recognition methods are gaining rapid development. Some advanced techniques, e.g., powerful modules, language models, and un- and semi-supervised learning schemes, consecutively push the performance on public benchmarks forward. However, the problem of how to better optimize a text recognition model from the perspective of loss functions is largely overlooked. CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with accuracy degradation. This is because CTC loss emphasizes the optimization of the entire sequence target while neglecting to learn individual characters. We propose a self-distillation scheme for CTC-based model to address this issue. It incorporates a framewise regularization term in CTC loss to emphasize individual supervision, and leverages the maximizing-a-posteriori of latent alignment to solve the inconsistency problem that arises in distillation between CTC-based models. We refer to the regularized CTC loss as Distillation Connectionist Temporal Classification (DCTC) loss. DCTC loss is module-free, requiring no extra parameters, longer inference lag, or additional training data or phases. Extensive experiments on public benchmarks demonstrate that DCTC can boost text recognition model accuracy by up to 2.6%, without any of these drawbacks.
Paper Structure (17 sections, 10 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 10 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: An illustraion of optimization and distillation on CTC- and attention-based models. Also shows the alignment inconsistency problem
  • Figure 2: The Architecture of DCTC in Self-distillation Scheme
  • Figure 3: Curves of AACC of Estimated Latent Alignment
  • Figure 4: Feature visualization. Each row represents a hard sample cluster