Table of Contents
Fetching ...

Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition

Yi-Cheng Lin, Yu-Hsuan Li Liang, Hsuan Su, Tzu-Quan Lin, Shang-Tse Chen, Yun-Nung Chen, Hung-yi Lee

TL;DR

This work introduces Pseudo2Real, a parameter-space correction that mitigates structured biases introduced by pseudo-labeling in ASR domain adaptation without target-ground-truth data. By computing a correction vector τ as the difference between real-label and pseudo-label finetuned models in a source domain, and applying θ_t^corrected = θ_t^pseudo + λ·τ to the target, the method achieves substantial WER improvements across ten AfriSpeech-200 accents and multiple Whisper scales, notably up to 35% relative reduction for Whisper tiny. The paper further extends to Pseudo2Real-SC, which uses speaker clustering to produce subgroup-specific correction vectors and aggregates them, improving robustness in several teacher–student configurations. Extensive experiments demonstrate the method’s effectiveness, analyze the impact of the scaling factor λ, and show that finer clustering (up to k=8) yields additional gains at a reasonable compute cost. The work highlights practical improvements for low-resource accents and provides a foundation for future multilingual and interpretability extensions, while acknowledging limitations in source supervision, language scope, and potential ethical considerations.

Abstract

Robust ASR under domain shift is crucial because real-world systems encounter unseen accents and domains with limited labeled data. Although pseudo-labeling offers a practical workaround, it often introduces systematic, accent-specific errors that filtering fails to fix. We ask: How can we correct these recurring biases without target ground truth? We propose a simple parameter-space correction: in a source domain containing both real and pseudo-labeled data, two ASR models are fine-tuned from the same initialization, one on ground-truth labels and the other on pseudo-labels, and their weight difference forms a correction vector that captures pseudo-label biases. When applied to a pseudo-labeled target model, this vector enhances recognition, achieving up to a 35% relative Word Error Rate (WER) reduction on AfriSpeech-200 across ten African accents with the Whisper tiny model.

Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition

TL;DR

This work introduces Pseudo2Real, a parameter-space correction that mitigates structured biases introduced by pseudo-labeling in ASR domain adaptation without target-ground-truth data. By computing a correction vector τ as the difference between real-label and pseudo-label finetuned models in a source domain, and applying θ_t^corrected = θ_t^pseudo + λ·τ to the target, the method achieves substantial WER improvements across ten AfriSpeech-200 accents and multiple Whisper scales, notably up to 35% relative reduction for Whisper tiny. The paper further extends to Pseudo2Real-SC, which uses speaker clustering to produce subgroup-specific correction vectors and aggregates them, improving robustness in several teacher–student configurations. Extensive experiments demonstrate the method’s effectiveness, analyze the impact of the scaling factor λ, and show that finer clustering (up to k=8) yields additional gains at a reasonable compute cost. The work highlights practical improvements for low-resource accents and provides a foundation for future multilingual and interpretability extensions, while acknowledging limitations in source supervision, language scope, and potential ethical considerations.

Abstract

Robust ASR under domain shift is crucial because real-world systems encounter unseen accents and domains with limited labeled data. Although pseudo-labeling offers a practical workaround, it often introduces systematic, accent-specific errors that filtering fails to fix. We ask: How can we correct these recurring biases without target ground truth? We propose a simple parameter-space correction: in a source domain containing both real and pseudo-labeled data, two ASR models are fine-tuned from the same initialization, one on ground-truth labels and the other on pseudo-labels, and their weight difference forms a correction vector that captures pseudo-label biases. When applied to a pseudo-labeled target model, this vector enhances recognition, achieving up to a 35% relative Word Error Rate (WER) reduction on AfriSpeech-200 across ten African accents with the Whisper tiny model.

Paper Structure

This paper contains 41 sections, 4 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of Pseudo2Real. a) In the source domain, two ASR models are fine-tuned from the same pretrained initialization: one using ground-truth transcripts and one using pseudo-labels. Their parameter difference defines a correction vector that captures systematic pseudo-labeling biases. b) In a new target domain, this correction vector is added to a pseudo-label fine-tuned model to produce a corrected ASR that better aligns with real-label performance. Color semantics: green = source-domain (ground-truth) knowledge, orange = pseudo-label noise, and purple = target-domain knowledge.
  • Figure 2: Learning and applying correction vectors in parameter space.a) A task vector is obtained by taking the difference between a pretrained model $\theta^{\text{pre}}$ and its fine-tuned version $\theta_{s}^{\text{real}}$ (or $\theta_{s}^{\text{pseudo}}$). b) In the source domain, two models are fine-tuned from the same pretrained initialization $\theta^{\text{pre}}$: one with real transcripts ($\theta_{s}^{\text{real}}$) and one with pseudo-labels ($\theta_{s}^{\text{pseudo}}$). Their difference defines the correction vector $\tau$. In a new target domain, we first obtain $\theta_{t}^{\text{pseudo}}$ by fine-tuning on pseudo-labels, then apply the correction vector to yield the final model $\theta_{t}^{\text{corrected}}$.
  • Figure 3: WER vs. scaling factor ($\lambda$). Each curve corresponds to a different teacher–student pairing. Here, the arrow ($\rightarrow$) denotes that pseudo-labels are generated by the teacher ASR model on the left and used to fine-tune the student model on the right (e.g., large$\rightarrow$tiny means pseudo-labels are produced by the large teacher, and the tiny student's parameters are then adjusted using the Pseudo2Real correction vector). Lower WER indicates better performance.
  • Figure 4: WER vs. number of K-means clusters for the large → small setting. Increasing the number of clusters improves adaptation quality (lower WER).