Table of Contents
Fetching ...

Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Mengzhe Geng, Xurong Xie, Jiajun Deng, Zengrui Jin, Guinan Li, Tianzi Wang, Shujie Hu, Zhaoqing Li, Helen Meng, Xunying Liu

TL;DR

This work addresses the mismatch, data scarcity, and speaker variability challenges in dysarthric and elderly ASR by introducing two on-the-fly, data-efficient adaptation methods. It combines variance-regularized spectral basis embedding (VR-SBE) features with a regression-based, on-the-fly LHUC transform (f-LHUC) driven by VR-SBE to rapidly personalize TDNN/Conformer models at test time. Across English and Cantonese datasets (four corpora), the proposed methods yield statistically significant WER/CER reductions up to 5.32% absolute (18.57% relative) over iVector/xVector baselines and up to 2.24% absolute (9.20% relative) over offline LHUC, while delivering real-time factors up to 33.6x faster than xVectors and demonstrating data-quantity invariance in adaptation. The results also show stronger speaker-level homogeneity than traditional embeddings, with t-SNE analyses supporting the improved consistency, and achieve competitive or state-of-the-art performance on challenging datasets such as UASpeech, indicating practical impact for real-time assistive ASR for dysarthric and elderly speakers.

Abstract

The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level variance-regularized spectral basis embedding (VR-SBE) features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation; and 2) feature-based learning hidden unit contributions (f-LHUC) transforms that are conditioned on VR-SBE features. Experiments are conducted on four tasks across two languages: the English UASpeech and TORGO dysarthric speech datasets, the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora. The proposed on-the-fly speaker adaptation techniques consistently outperform baseline iVector and xVector adaptation by statistically significant word or character error rate reductions up to 5.32% absolute (18.57% relative) and batch-mode LHUC speaker adaptation by 2.24% absolute (9.20% relative), while operating with real-time factors speeding up to 33.6 times against xVectors during adaptation. The efficacy of the proposed adaptation techniques is demonstrated in a comparison against current ASR technologies including SSL pre-trained systems on UASpeech, where our best system produces a state-of-the-art WER of 23.33%. Analyses show VR-SBE features and f-LHUC transforms are insensitive to speaker-level data quantity in testtime adaptation. T-SNE visualization reveals they have stronger speaker-level homogeneity than baseline iVectors, xVectors and batch-mode LHUC transforms.

Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

TL;DR

This work addresses the mismatch, data scarcity, and speaker variability challenges in dysarthric and elderly ASR by introducing two on-the-fly, data-efficient adaptation methods. It combines variance-regularized spectral basis embedding (VR-SBE) features with a regression-based, on-the-fly LHUC transform (f-LHUC) driven by VR-SBE to rapidly personalize TDNN/Conformer models at test time. Across English and Cantonese datasets (four corpora), the proposed methods yield statistically significant WER/CER reductions up to 5.32% absolute (18.57% relative) over iVector/xVector baselines and up to 2.24% absolute (9.20% relative) over offline LHUC, while delivering real-time factors up to 33.6x faster than xVectors and demonstrating data-quantity invariance in adaptation. The results also show stronger speaker-level homogeneity than traditional embeddings, with t-SNE analyses supporting the improved consistency, and achieve competitive or state-of-the-art performance on challenging datasets such as UASpeech, indicating practical impact for real-time assistive ASR for dysarthric and elderly speakers.

Abstract

The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level variance-regularized spectral basis embedding (VR-SBE) features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation; and 2) feature-based learning hidden unit contributions (f-LHUC) transforms that are conditioned on VR-SBE features. Experiments are conducted on four tasks across two languages: the English UASpeech and TORGO dysarthric speech datasets, the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora. The proposed on-the-fly speaker adaptation techniques consistently outperform baseline iVector and xVector adaptation by statistically significant word or character error rate reductions up to 5.32% absolute (18.57% relative) and batch-mode LHUC speaker adaptation by 2.24% absolute (9.20% relative), while operating with real-time factors speeding up to 33.6 times against xVectors during adaptation. The efficacy of the proposed adaptation techniques is demonstrated in a comparison against current ASR technologies including SSL pre-trained systems on UASpeech, where our best system produces a state-of-the-art WER of 23.33%. Analyses show VR-SBE features and f-LHUC transforms are insensitive to speaker-level data quantity in testtime adaptation. T-SNE visualization reveals they have stronger speaker-level homogeneity than baseline iVectors, xVectors and batch-mode LHUC transforms.
Paper Structure (32 sections, 5 equations, 6 figures, 17 tables)

This paper contains 32 sections, 5 equations, 6 figures, 17 tables.

Figures (6)

  • Figure 1: Example extraction of variance-regularized spectral basis embedding (VR-SBE) features (bottom right) for on-the-fly speaker adaptation over 3 phases: 1) utterance-level SVD spectrum decomposition (left, in light yellow) with $C = 40$ mel-filterbank channels and retaining top $d = 2$ principal spectral bases; 2) multitask learning to derive utterance-level spectral basis embeddings using speaker ("spkr.") IDs and speech intelligibility ("intel.") or age groups; and 3) extraction of VR-SBE features with averaged speaker embeddings from phase-2 for computing the additional MSE cost.
  • Figure 2: Incorporatiing our on-the-fly VR-SBE adaptation at front-ends of (a) hybrid TDNN and (b) E2E Conformer. In (a), path (i) leads to additional model-based LHUC adaptation (top) while path (ii) incorporates our on-the-fly f-LHUC adaptation (bottom). In (b), path (i) and (ii) respectively apply LHUC adaptation after convolution pooling and a particular encoder block.
  • Figure 3: Example f-LHUC regression network using FBK + VR-SBE input features with an online cross-utterance hidden context averaging layer (top right in light blue) to predict homogeneous speaker-level LHUC transforms on the fly.
  • Figure 4: Overall procedure of building VR-SBE feature driven f-LHUC SAT DNN/TDNN systems over three phases: 1) f-LHUC regression network training (in blue); 2) acoustic model (AM) fine-tuning with f-LHUC transforms (in orange); and 3) on-the-fly speaker adaptation with f-LHUC transforms (in red). Here "AM’’ stands for DNN/TDNN acoustic model, while "Reg. Network” denotes the f-LHUC regression network. $"\ast"$ indicates the model or network is frozen during inference.
  • Figure 5: T-SNE visualization illustrating speaker feature homogeneity measured by covariance determinants after applying t-SNE projection: (a)-(c) on-the-fly VR-SBE features vs. iVectors and xVectors (d)-(f) VR-SBE vs. batch-mode SBE features and LHUC transforms and (g)-(i) VR-SBE feature driven f-LHUC vs. batch-mode SBE features and LHUC transforms obtained from speaker F02, F03 and M11 of UASpeech.
  • ...and 1 more figures