Table of Contents
Fetching ...

CR-CTC: Consistency regularization on CTC for improved speech recognition

Zengwei Yao, Wei Kang, Xiaoyu Yang, Fangjun Kuang, Liyong Guo, Han Zhu, Zengrui Jin, Zhaoqing Li, Long Lin, Daniel Povey

TL;DR

This paper tackles the gap between the efficient CTC-based ASR models and more powerful transducer/AED systems by introducing Consistency-Regularized CTC (CR-CTC). CR-CTC enforces agreement between two CTC distributions produced from independently augmented views of the same speech input, via a shared encoder and a two-branch architecture, and optimizes a combined loss that includes two CTC terms and a cross-branch consistency term: $L = \tfrac{1}{2}(L_{CTC}(z^{(a)}, y) + L_{CTC}(z^{(b)}, y)) + \alpha L_{CR}(z^{(a)}, z^{(b)})$. The approach also emphasizes masked prediction within time-masked regions to learn contextual representations and peak suppression to prevent overfitting, with time masking increased by a factor of 2.5. Empirical results on LibriSpeech, Aishell-1, and GigaSpeech show substantial improvements over vanilla CTC and competitive performance with Transducer and CTC/AED baselines, including strong gains in joint training settings. The work provides a practical, self-contained method that narrows the performance gap for CTC-based ASR and is accompanied by an open-source implementation.

Abstract

Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. We provide in-depth insights into its essential behaviors from three perspectives: 1) it conducts self-distillation between random pairs of sub-models that process different augmented views; 2) it learns contextual representation through masked prediction for positions within time-masked regions, especially when we increase the amount of time masking; 3) it suppresses the extremely peaky CTC distributions, thereby reducing overfitting and improving the generalization ability. Extensive experiments on LibriSpeech, Aishell-1, and GigaSpeech datasets demonstrate the effectiveness of our CR-CTC. It significantly improves the CTC performance, achieving state-of-the-art results comparable to those attained by transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED). We release our code at https://github.com/k2-fsa/icefall.

CR-CTC: Consistency regularization on CTC for improved speech recognition

TL;DR

This paper tackles the gap between the efficient CTC-based ASR models and more powerful transducer/AED systems by introducing Consistency-Regularized CTC (CR-CTC). CR-CTC enforces agreement between two CTC distributions produced from independently augmented views of the same speech input, via a shared encoder and a two-branch architecture, and optimizes a combined loss that includes two CTC terms and a cross-branch consistency term: . The approach also emphasizes masked prediction within time-masked regions to learn contextual representations and peak suppression to prevent overfitting, with time masking increased by a factor of 2.5. Empirical results on LibriSpeech, Aishell-1, and GigaSpeech show substantial improvements over vanilla CTC and competitive performance with Transducer and CTC/AED baselines, including strong gains in joint training settings. The work provides a practical, self-contained method that narrows the performance gap for CTC-based ASR and is accompanied by an open-source implementation.

Abstract

Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. We provide in-depth insights into its essential behaviors from three perspectives: 1) it conducts self-distillation between random pairs of sub-models that process different augmented views; 2) it learns contextual representation through masked prediction for positions within time-masked regions, especially when we increase the amount of time masking; 3) it suppresses the extremely peaky CTC distributions, thereby reducing overfitting and improving the generalization ability. Extensive experiments on LibriSpeech, Aishell-1, and GigaSpeech datasets demonstrate the effectiveness of our CR-CTC. It significantly improves the CTC performance, achieving state-of-the-art results comparable to those attained by transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED). We release our code at https://github.com/k2-fsa/icefall.
Paper Structure (19 sections, 8 equations, 2 figures, 17 tables)

This paper contains 19 sections, 8 equations, 2 figures, 17 tables.

Figures (2)

  • Figure 1: Overall architecture of CR-CTC.
  • Figure 2: Visualization of token emitting probabilities for vanilla CTC (left) and our CR-CTC (right) on four randomly selected samples from LibriSpeech test set. The gray dashed lines indicate the blank token. Compared to vanilla CTC, the token distributions in CR-CTC are smoother with lower emitting probabilities and more repeating non-blank tokens.