Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition
Yi-Cheng Lin, Yu-Hsuan Li Liang, Hsuan Su, Tzu-Quan Lin, Shang-Tse Chen, Yun-Nung Chen, Hung-yi Lee
TL;DR
This work introduces Pseudo2Real, a parameter-space correction that mitigates structured biases introduced by pseudo-labeling in ASR domain adaptation without target-ground-truth data. By computing a correction vector τ as the difference between real-label and pseudo-label finetuned models in a source domain, and applying θ_t^corrected = θ_t^pseudo + λ·τ to the target, the method achieves substantial WER improvements across ten AfriSpeech-200 accents and multiple Whisper scales, notably up to 35% relative reduction for Whisper tiny. The paper further extends to Pseudo2Real-SC, which uses speaker clustering to produce subgroup-specific correction vectors and aggregates them, improving robustness in several teacher–student configurations. Extensive experiments demonstrate the method’s effectiveness, analyze the impact of the scaling factor λ, and show that finer clustering (up to k=8) yields additional gains at a reasonable compute cost. The work highlights practical improvements for low-resource accents and provides a foundation for future multilingual and interpretability extensions, while acknowledging limitations in source supervision, language scope, and potential ethical considerations.
Abstract
Robust ASR under domain shift is crucial because real-world systems encounter unseen accents and domains with limited labeled data. Although pseudo-labeling offers a practical workaround, it often introduces systematic, accent-specific errors that filtering fails to fix. We ask: How can we correct these recurring biases without target ground truth? We propose a simple parameter-space correction: in a source domain containing both real and pseudo-labeled data, two ASR models are fine-tuned from the same initialization, one on ground-truth labels and the other on pseudo-labels, and their weight difference forms a correction vector that captures pseudo-label biases. When applied to a pseudo-labeled target model, this vector enhances recognition, achieving up to a 35% relative Word Error Rate (WER) reduction on AfriSpeech-200 across ten African accents with the Whisper tiny model.
