Table of Contents
Fetching ...

CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-based Speech Recognition

Junfeng Hou, Peiyao Wang, Jincheng Zhang, Meng Yang, Minwei Feng, Jingcheng Yin

TL;DR

This work tackles the efficiency challenge of end-to-end CTC-based speech recognition by introducing a dynamic encoder that skips the last few layers for frames with high CTC blank probability, guided by blanks emitted by intermediate layers. It combines a gating mechanism using $p^{in}_t(\phi)$ with spike-extension, and enhances robustness through multi-task CTC with knowledge distillation and a factorized CTC formulation to accelerate decoding. The proposed approach achieves substantial runtime speedups (e.g., up to 38%–23% RTF reductions depending on decoding mode) while maintaining competitive WER, and adds a 6% additional speedup via factorized CTC with minor accuracy trade-offs. This work demonstrates a practical pathway to adaptive inference for resource-constrained ASR deployments, with implications for real-time applications and scalable systems.

Abstract

Deploying end-to-end speech recognition models with limited computing resources remains challenging, despite their impressive performance. Given the gradual increase in model size and the wide range of model applications, selectively executing model components for different inputs to improve the inference efficiency is of great interest. In this paper, we propose a dynamic layer-skipping method that leverages the CTC blank output from intermediate layers to trigger the skipping of the last few encoder layers for frames with high blank probabilities. Furthermore, we factorize the CTC output distribution and perform knowledge distillation on intermediate layers to reduce computation and improve recognition accuracy. Experimental results show that by utilizing the CTC blank, the encoder layer depth can be adjusted dynamically, resulting in 29% acceleration of the CTC model inference with minor performance degradation.

CTC Blank Triggered Dynamic Layer-Skipping for Efficient CTC-based Speech Recognition

TL;DR

This work tackles the efficiency challenge of end-to-end CTC-based speech recognition by introducing a dynamic encoder that skips the last few layers for frames with high CTC blank probability, guided by blanks emitted by intermediate layers. It combines a gating mechanism using with spike-extension, and enhances robustness through multi-task CTC with knowledge distillation and a factorized CTC formulation to accelerate decoding. The proposed approach achieves substantial runtime speedups (e.g., up to 38%–23% RTF reductions depending on decoding mode) while maintaining competitive WER, and adds a 6% additional speedup via factorized CTC with minor accuracy trade-offs. This work demonstrates a practical pathway to adaptive inference for resource-constrained ASR deployments, with implications for real-time applications and scalable systems.

Abstract

Deploying end-to-end speech recognition models with limited computing resources remains challenging, despite their impressive performance. Given the gradual increase in model size and the wide range of model applications, selectively executing model components for different inputs to improve the inference efficiency is of great interest. In this paper, we propose a dynamic layer-skipping method that leverages the CTC blank output from intermediate layers to trigger the skipping of the last few encoder layers for frames with high blank probabilities. Furthermore, we factorize the CTC output distribution and perform knowledge distillation on intermediate layers to reduce computation and improve recognition accuracy. Experimental results show that by utilizing the CTC blank, the encoder layer depth can be adjusted dynamically, resulting in 29% acceleration of the CTC model inference with minor performance degradation.
Paper Structure (9 sections, 4 equations, 2 figures, 3 tables)

This paper contains 9 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Architecture of CTC blank triggered dynamic layer-skipping model. The encoder consists of two parts. The first part is consistently active, while the second part's execution is determined by CTC blank.
  • Figure 2: The CTC output of different layer-skipping models. (Left): Model without KL loss; (Right): Model with KL loss.