Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Yuchen Hu; Chen Chen; Chao-Han Huck Yang; Chengwei Qin; Pin-Yu Chen; Eng Siong Chng; Chao Zhang

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Chengwei Qin, Pin-Yu Chen, Eng Siong Chng, Chao Zhang

TL;DR

STAR addresses unsupervised domain adaptation for ASR by leveraging unlabeled target-domain data to adapt speech foundation models without accessing source data. It introduces a token-level quality indicator derived from decoding attention and a reweighted training objective, complemented by utterance-level filtering to guide informed finetuning. Across 14 target domains, STAR achieves an average relative WER reduction of $13.5\%$, sometimes approaching supervised upper bounds, and demonstrates data efficiency with under $1$ hour of unlabeled data, as well as applicability to other models and speech translation tasks. Importantly, STAR mitigates catastrophic forgetting and offers a practical, generalizable framework for rapid deployment in real-world scenarios.

Abstract

We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifically, we propose a novel indicator that empirically integrates step-wise information during decoding to assess the token-level quality of pseudo labels without ground truth, thereby guiding model updates for effective unsupervised adaptation. Experimental results show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains, and it sometimes even approaches the upper-bound performance of supervised adaptation. Surprisingly, we also observe that STAR prevents the adapted model from the common catastrophic forgetting problem without recalling source-domain data. Furthermore, STAR exhibits high data efficiency that only requires less than one-hour unlabeled data, and seamless generality to alternative large speech models and speech translation tasks. Our code aims to open source to the research communities.

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

TL;DR

, sometimes approaching supervised upper bounds, and demonstrates data efficiency with under

hour of unlabeled data, as well as applicability to other models and speech translation tasks. Importantly, STAR mitigates catastrophic forgetting and offers a practical, generalizable framework for rapid deployment in real-world scenarios.

Abstract

Paper Structure (27 sections, 9 equations, 10 figures, 12 tables, 1 algorithm)

This paper contains 27 sections, 9 equations, 10 figures, 12 tables, 1 algorithm.

Introduction
Related Work
Methodology
Problem Setup
Token-level Assessment and Re-weighting
Utterance-level Filtering
Experimental Setup
ASR Domains
Configurations
Results and Analysis
Effectiveness of STAR
Generality of STAR
Ablation Study
Conclusion
Additional Discussions on the Design of the STAR Framework
...and 12 more sections

Figures (10)

Figure 1: Illustration of unsupervised domain adaptation (UDA) and source-free UDA frameworks. (i) UDA problem. (ii) Source-free UDA by self-training. STAR works by selecting high-quality pseudo labels and guiding the ASR foundation model's adaptation at the token level.
Figure 2: (Left): An example of pseudo label, ground-truth transcription, confidence scores, attention matrix and attentive scores. (Right-Up): Confusion matrix of confidence and attentive scores, where the y-axis denotes the pseudo token is correct or wrong, and the x-axis denotes the corresponding score is high or low (with 1 as the threshold, more analysis is in Fig. \ref{['f7']}), so that the diagonal values indicate the score's reliability in assessing the quality of pseudo-label. (Right-Down): Variance of the two scores of correct and wrong pseudo tokens.
Figure 3: WER (%) results with different numbers of unlabeled training samples. The minimum required data amount (in hours) to obtain the best performance is highlighted in the star mark.
Figure 4: WER (%) results of STAR with different speech foundation models on CHiME-4 test-real. More models / datasets are evaluated in Table \ref{['table:main_results_model_size']} and \ref{['table:main_results_seamless']}.
Figure 4: Spectrograms of parallel clean and noisy speech samples, where we select two noise types for visualization, i.e., airport station and babble (used in our experiments). The speech samples are selected from the LS-FreeSound test set, and the sample ID is "1089-134686-0003".
...and 5 more figures

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

TL;DR

Abstract

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)