Table of Contents
Fetching ...

Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

Eungbeom Kim, Hantae Kim, Kyogu Lee

TL;DR

The paper addresses the instability of frame-level alignments in KD for CTC-based ASR by introducing Frame-Level Self-knowledge Distillation (SKD), which embeds a teacher and student within a single shared encoder. SKD uses a frame-level loss with stop-gradient on the teacher and a scheduling mechanism that gradually shifts emphasis from teacher to student, avoiding the alignment-disagreement problem inherent in traditional teacher-student KD. Empirical results show SKD often yields lower WER than baselines like Guide-CTC and Sfmx-KD, and that masking blank frames is not beneficial in SKD; the approach also improves resource efficiency by sharing layers. Overall, SKD provides a practical and effective way to enhance CTC-based ASR performance while reducing memory and computation, with demonstrated gains on HuBERT and WavLM backbones.

Abstract

Transformer encoder with connectionist temporal classification (CTC) framework is widely used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR displays a problem of disagreement between teacher-student models in frame-level alignment which ultimately hinders it from improving the student model's performance. In order to resolve this problem, this paper introduces a self-knowledge distillation (SKD) method that guides the frame-level alignment during the training time. In contrast to the conventional method using separate teacher and student models, this study introduces a simple and effective method sharing encoder layers and applying the sub-model as the student model. Overall, our approach is effective in improving both the resource efficiency as well as performance. We also conducted an experimental analysis of the spike timings to illustrate that the proposed method improves performance by reducing the alignment disagreement.

Guiding Frame-Level CTC Alignments Using Self-knowledge Distillation

TL;DR

The paper addresses the instability of frame-level alignments in KD for CTC-based ASR by introducing Frame-Level Self-knowledge Distillation (SKD), which embeds a teacher and student within a single shared encoder. SKD uses a frame-level loss with stop-gradient on the teacher and a scheduling mechanism that gradually shifts emphasis from teacher to student, avoiding the alignment-disagreement problem inherent in traditional teacher-student KD. Empirical results show SKD often yields lower WER than baselines like Guide-CTC and Sfmx-KD, and that masking blank frames is not beneficial in SKD; the approach also improves resource efficiency by sharing layers. Overall, SKD provides a practical and effective way to enhance CTC-based ASR performance while reducing memory and computation, with demonstrated gains on HuBERT and WavLM backbones.

Abstract

Transformer encoder with connectionist temporal classification (CTC) framework is widely used for automatic speech recognition (ASR). However, knowledge distillation (KD) for ASR displays a problem of disagreement between teacher-student models in frame-level alignment which ultimately hinders it from improving the student model's performance. In order to resolve this problem, this paper introduces a self-knowledge distillation (SKD) method that guides the frame-level alignment during the training time. In contrast to the conventional method using separate teacher and student models, this study introduces a simple and effective method sharing encoder layers and applying the sub-model as the student model. Overall, our approach is effective in improving both the resource efficiency as well as performance. We also conducted an experimental analysis of the spike timings to illustrate that the proposed method improves performance by reducing the alignment disagreement.
Paper Structure (14 sections, 7 equations, 2 figures, 3 tables)

This paper contains 14 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustrations of knowledge distillation for ASR, comparing frame-level teacher-student alignments when the teacher model outputs the word "CAT". Red arrow denotes undesired distillation and blue arrow denotes desired distillation. (a) and (b) include undesired distillation and miss desired distillation while (c) only includes desired distillation.
  • Figure 2: A framework for self-knowledge distillation of $L$-layer CTC-based ASR model with $l$-layer student model. The total training loss contains self-teacher loss and self-student loss.