Table of Contents
Fetching ...

BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

Raphaël Bagat, Irina Illina, Emmanuel Vincent

TL;DR

BEARD presents a novel approach for adapting a pre-trained encoder-decoder ASR model to a new domain using unlabeled data. It combines a BEST-RQ self-supervised objective applied to a middle encoder layer with two distillation losses from a frozen teacher to preserve decoder complementarity, followed by fine-tuning on limited labeled data. On the ATCO2 Air Traffic Control corpus, BEARD achieves a 12% relative WER improvement over standard fine-tuning and demonstrates robustness across noise levels, supported by ablation analyses. This work shows that self-supervised encoder adaptation can effectively leverage unlabeled data to enhance domain-specific ASR without changing the decoder architecture.

Abstract

Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in out-of-domain and low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder using unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

TL;DR

BEARD presents a novel approach for adapting a pre-trained encoder-decoder ASR model to a new domain using unlabeled data. It combines a BEST-RQ self-supervised objective applied to a middle encoder layer with two distillation losses from a frozen teacher to preserve decoder complementarity, followed by fine-tuning on limited labeled data. On the ATCO2 Air Traffic Control corpus, BEARD achieves a 12% relative WER improvement over standard fine-tuning and demonstrates robustness across noise levels, supported by ablation analyses. This work shows that self-supervised encoder adaptation can effectively leverage unlabeled data to enhance domain-specific ASR without changing the decoder architecture.

Abstract

Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in out-of-domain and low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder using unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

Paper Structure

This paper contains 12 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Architecture of the proposed BEARD framework. On the left side, we use BEST-RQ's objective ($\mathcal{L}_q^\ell$). It is applied to the output of the $\ell$-th Transformer layer. On the right, we use two distillation losses: $\mathcal{L}_d^\ell$, $\mathcal{L}_d^n$. They are computed at two different layers, the $\ell$-th layer and the output layer, respectively, by leveraging a frozen teacher encoder.
  • Figure 2: Comparison of WER across SNR bins for our best BEARD configuration ($\ell=6,\lambda=0.5$) and FT. SNR was estimated using WADA-SNR kim2008robust.