Table of Contents
Fetching ...

Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASR

Longhao Li, Yangze Li, Hongfei Xue, Jie Liu, Shuai Fang, Kai Wang, Lei Xie

TL;DR

Delayed-KD addresses the latency-accuracy trade-off in CTC-based streaming ASR by distilling from a non-streaming teacher to a streaming student through a Temporal Alignment Buffer (TAB). The approach jointly optimizes ASR and distillation losses, using a two-pass rescoring decoder to balance frame-level CTC cues with attention-based AED scoring. Empirical results on AISHELL-1 and WenetSpeech show that 40 ms latency with TAB can match or surpass higher-latency baselines, and TAB enables controllable emission delay with robust gains in both streaming and rescoring modes. Overall, Delayed-KD delivers a practical, low-latency, high-accuracy streaming ASR solution with scalable performance on large-scale Mandarin data.

Abstract

CTC-based streaming ASR has gained significant attention in real-world applications but faces two main challenges: accuracy degradation in small chunks and token emission latency. To mitigate these challenges, we propose Delayed-KD, which applies delayed knowledge distillation on CTC posterior probabilities from a non-streaming to a streaming model. Specifically, with a tiny chunk size, we introduce a Temporal Alignment Buffer (TAB) that defines a relative delay range compared to the non-streaming teacher model to align CTC outputs and mitigate non-blank token mismatches. Additionally, TAB enables fine-grained control over token emission delay. Experiments on 178-hour AISHELL-1 and 10,000-hour WenetSpeech Mandarin datasets show consistent superiority of Delayed-KD. Impressively, Delayed-KD at 40 ms latency achieves a lower character error rate (CER) of 5.42% on AISHELL-1, comparable to the competitive U2++ model running at 320 ms latency.

Delayed-KD: Delayed Knowledge Distillation based CTC for Low-Latency Streaming ASR

TL;DR

Delayed-KD addresses the latency-accuracy trade-off in CTC-based streaming ASR by distilling from a non-streaming teacher to a streaming student through a Temporal Alignment Buffer (TAB). The approach jointly optimizes ASR and distillation losses, using a two-pass rescoring decoder to balance frame-level CTC cues with attention-based AED scoring. Empirical results on AISHELL-1 and WenetSpeech show that 40 ms latency with TAB can match or surpass higher-latency baselines, and TAB enables controllable emission delay with robust gains in both streaming and rescoring modes. Overall, Delayed-KD delivers a practical, low-latency, high-accuracy streaming ASR solution with scalable performance on large-scale Mandarin data.

Abstract

CTC-based streaming ASR has gained significant attention in real-world applications but faces two main challenges: accuracy degradation in small chunks and token emission latency. To mitigate these challenges, we propose Delayed-KD, which applies delayed knowledge distillation on CTC posterior probabilities from a non-streaming to a streaming model. Specifically, with a tiny chunk size, we introduce a Temporal Alignment Buffer (TAB) that defines a relative delay range compared to the non-streaming teacher model to align CTC outputs and mitigate non-blank token mismatches. Additionally, TAB enables fine-grained control over token emission delay. Experiments on 178-hour AISHELL-1 and 10,000-hour WenetSpeech Mandarin datasets show consistent superiority of Delayed-KD. Impressively, Delayed-KD at 40 ms latency achieves a lower character error rate (CER) of 5.42% on AISHELL-1, comparable to the competitive U2++ model running at 320 ms latency.

Paper Structure

This paper contains 14 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Model architecture of our proposed Delayed-KD
  • Figure 2: Comparison of CTC spike distributions between U2++ and Delayed-KD. Colored lines represent CTC spikes, and dashed lines align time axis positions.