Table of Contents
Fetching ...

Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data

Liang-Hsuan Tseng, Zih-Ching Chen, Wei-Shun Chang, Cheng-Kuang Lee, Tsung-Ren Huang, Hung-yi Lee

TL;DR

The paper tackles the practicality gap in code-switching ASR by introducing K^2D, a three-stage knowledge distillation framework that leverages realistic unlabeled data. It combines realistic pseudo-labeling from a large teacher, a lightweight auxiliary validator for data filtering, and distillation to a compact student, with a loss that blends cross-entropy and KL-divergence terms. Empirical results on in-domain and out-of-domain datasets show that the student surpasses the teacher while being roughly 2× smaller and 5× faster, and that the composite distance metric yields the best balance between accuracy and hallucination control. This approach demonstrates that effective CS-ASR KD is achievable with unlabeled realistic data and a simple validation mechanism, offering practical benefits for deployment and future research in resource-constrained settings.

Abstract

Recent advances in automatic speech recognition (ASR) often rely on large speech foundation models for generating high-quality transcriptions. However, these models can be impractical due to limited computing resources. The situation is even more severe in terms of more realistic or difficult scenarios, such as code-switching ASR (CS-ASR). To address this, we present a framework for developing more efficient models for CS-ASR through knowledge distillation using realistic speech-only data. Our proposed method, Leave No Knowledge Behind During Knowledge Distillation (K$^2$D), leverages both the teacher model's knowledge and additional insights from a small auxiliary model. We evaluate our approach on two in-domain and two out-domain datasets, demonstrating that K$^2$D is effective. By conducting K$^2$D on the unlabeled realistic data, we have successfully obtained a 2-time smaller model with 5-time faster generation speed while outperforming the baseline methods and the teacher model on all the testing sets. We have made our model publicly available on Hugging Face (https://huggingface.co/andybi7676/k2d-whisper.zh-en).

Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data

TL;DR

The paper tackles the practicality gap in code-switching ASR by introducing K^2D, a three-stage knowledge distillation framework that leverages realistic unlabeled data. It combines realistic pseudo-labeling from a large teacher, a lightweight auxiliary validator for data filtering, and distillation to a compact student, with a loss that blends cross-entropy and KL-divergence terms. Empirical results on in-domain and out-of-domain datasets show that the student surpasses the teacher while being roughly 2× smaller and 5× faster, and that the composite distance metric yields the best balance between accuracy and hallucination control. This approach demonstrates that effective CS-ASR KD is achievable with unlabeled realistic data and a simple validation mechanism, offering practical benefits for deployment and future research in resource-constrained settings.

Abstract

Recent advances in automatic speech recognition (ASR) often rely on large speech foundation models for generating high-quality transcriptions. However, these models can be impractical due to limited computing resources. The situation is even more severe in terms of more realistic or difficult scenarios, such as code-switching ASR (CS-ASR). To address this, we present a framework for developing more efficient models for CS-ASR through knowledge distillation using realistic speech-only data. Our proposed method, Leave No Knowledge Behind During Knowledge Distillation (KD), leverages both the teacher model's knowledge and additional insights from a small auxiliary model. We evaluate our approach on two in-domain and two out-domain datasets, demonstrating that KD is effective. By conducting KD on the unlabeled realistic data, we have successfully obtained a 2-time smaller model with 5-time faster generation speed while outperforming the baseline methods and the teacher model on all the testing sets. We have made our model publicly available on Hugging Face (https://huggingface.co/andybi7676/k2d-whisper.zh-en).
Paper Structure (21 sections, 10 equations, 2 figures, 5 tables)

This paper contains 21 sections, 10 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Our proposed framework K$^2$D achieves significant performance improvement over the teacher model (Whisper Large-v2) on both in-domain and OOD testing sets.
  • Figure 2: Overview of the K$^2$D Framework. (a) Realistic Pseudo-Labeling: The teacher model generates transcriptions with timestamps from long-form audio. (b) Data Pre-Filtering: Chunked audio is validated by the small auxiliary model, filtering out inaccurate labels. (c) Knowledge Distillation: Validated pseudo-labels are used to train the student model, enhancing accuracy and efficiency.