Table of Contents
Fetching ...

Two-Stage Adaptation for Non-Normative Speech Recognition: Revisiting Speaker-Independent Initialization for Personalization

Shan Jiang, Jiawen Qi, Chuanbing Huo, Yingqiang Gao, Qinyu Chen

Abstract

Personalizing automatic speech recognition (ASR) systems for non-normative speech, such as dysarthric and aphasic speech, is challenging. While speaker-specific fine-tuning (SS-FT) is widely used, it is typically initialized directly from a generic pre-trained model. Whether speaker-independent adaptation provides a stronger initialization prior under such mismatch remains unclear. In this work, we propose a two-stage adaptation framework consisting of speaker-independent fine-tuning (SI-FT) on multi-speaker non-normative data followed by SS-FT, and evaluate it through a controlled comparison with direct SS-FT under identical per-speaker conditions. Experiments on AphasiaBank and UA-Speech with Whisper-Large-v3 and Qwen3-ASR, alongside evaluation on typical-speech datasets TED-LIUM v3 and FLEURS, show that two-stage adaptation consistently improves personalization while maintaining manageable out-of-domain (OOD) trade-offs.

Two-Stage Adaptation for Non-Normative Speech Recognition: Revisiting Speaker-Independent Initialization for Personalization

Abstract

Personalizing automatic speech recognition (ASR) systems for non-normative speech, such as dysarthric and aphasic speech, is challenging. While speaker-specific fine-tuning (SS-FT) is widely used, it is typically initialized directly from a generic pre-trained model. Whether speaker-independent adaptation provides a stronger initialization prior under such mismatch remains unclear. In this work, we propose a two-stage adaptation framework consisting of speaker-independent fine-tuning (SI-FT) on multi-speaker non-normative data followed by SS-FT, and evaluate it through a controlled comparison with direct SS-FT under identical per-speaker conditions. Experiments on AphasiaBank and UA-Speech with Whisper-Large-v3 and Qwen3-ASR, alongside evaluation on typical-speech datasets TED-LIUM v3 and FLEURS, show that two-stage adaptation consistently improves personalization while maintaining manageable out-of-domain (OOD) trade-offs.
Paper Structure (14 sections, 2 figures, 1 table)

This paper contains 14 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Overview of our methodology. Left: transcript normalization and filtering for AphasiaBank. Middle: speaker-level partition into speaker-independent (SI) and speaker-specific (SS) groups (top 10% speakers selected for personalization), with per-speaker train/validation/test splits. Right: comparison of four experimental conditions (B1–B4), isolating the effect of speaker-independent initialization for speaker-specific adaptation.
  • Figure 2: Per-speaker $\Delta$WER (B3-B4) and $\Delta$CER (B3-B4) on AphasiaBank for two ASR systems.