Table of Contents
Fetching ...

Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition

Huimeng Wang, Xurong Xie, Mengzhe Geng, Shujie Hu, Haoning Xu, Youjun Chen, Zhaoqing Li, Jiajun Deng, Xunying Liu

TL;DR

The paper tackles dysarthric speech recognition by leveraging phone-purity guided discrete tokens derived from HuBERT representations. By regularizing K-means and VAE-VQ token extraction with frame-level phonetic labels, the authors achieve sharper clustering and higher phonetic discriminability, yielding consistent WER reductions on the UASpeech dataset. The approach delivers up to 0.99% absolute and 1.77% absolute improvements for TDNN and Conformer baselines, respectively, with a best combined WER of 23.25%, and it is supported by gains in the phone purity metric and visualizations. This work demonstrates that supervised guidance during discrete token extraction can meaningfully improve dysarthric ASR and offers a pathway for more robust, domain-adaptive token-based speech representations.

Abstract

Discrete tokens extracted provide efficient and domain adaptable speech features. Their application to disordered speech that exhibits articulation imprecision and large mismatch against normal voice remains unexplored. To improve their phonetic discrimination that is weakened during unsupervised K-means or vector quantization of continuous features, this paper proposes novel phone-purity guided (PPG) discrete tokens for dysarthric speech recognition. Phonetic label supervision is used to regularize maximum likelihood and reconstruction error costs used in standard K-means and VAE-VQ based discrete token extraction. Experiments conducted on the UASpeech corpus suggest that the proposed PPG discrete token features extracted from HuBERT consistently outperform hybrid TDNN and End-to-End (E2E) Conformer systems using non-PPG based K-means or VAE-VQ tokens across varying codebook sizes by statistically significant word error rate (WER) reductions up to 0.99\% and 1.77\% absolute (3.21\% and 4.82\% relative) respectively on the UASpeech test set of 16 dysarthric speakers. The lowest WER of 23.25\% was obtained by combining systems using different token features. Consistent improvements on the phone purity metric were also achieved. T-SNE visualization further demonstrates sharper decision boundaries were produced between K-means/VAE-VQ clusters after introducing phone-purity guidance.

Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition

TL;DR

The paper tackles dysarthric speech recognition by leveraging phone-purity guided discrete tokens derived from HuBERT representations. By regularizing K-means and VAE-VQ token extraction with frame-level phonetic labels, the authors achieve sharper clustering and higher phonetic discriminability, yielding consistent WER reductions on the UASpeech dataset. The approach delivers up to 0.99% absolute and 1.77% absolute improvements for TDNN and Conformer baselines, respectively, with a best combined WER of 23.25%, and it is supported by gains in the phone purity metric and visualizations. This work demonstrates that supervised guidance during discrete token extraction can meaningfully improve dysarthric ASR and offers a pathway for more robust, domain-adaptive token-based speech representations.

Abstract

Discrete tokens extracted provide efficient and domain adaptable speech features. Their application to disordered speech that exhibits articulation imprecision and large mismatch against normal voice remains unexplored. To improve their phonetic discrimination that is weakened during unsupervised K-means or vector quantization of continuous features, this paper proposes novel phone-purity guided (PPG) discrete tokens for dysarthric speech recognition. Phonetic label supervision is used to regularize maximum likelihood and reconstruction error costs used in standard K-means and VAE-VQ based discrete token extraction. Experiments conducted on the UASpeech corpus suggest that the proposed PPG discrete token features extracted from HuBERT consistently outperform hybrid TDNN and End-to-End (E2E) Conformer systems using non-PPG based K-means or VAE-VQ tokens across varying codebook sizes by statistically significant word error rate (WER) reductions up to 0.99\% and 1.77\% absolute (3.21\% and 4.82\% relative) respectively on the UASpeech test set of 16 dysarthric speakers. The lowest WER of 23.25\% was obtained by combining systems using different token features. Consistent improvements on the phone purity metric were also achieved. T-SNE visualization further demonstrates sharper decision boundaries were produced between K-means/VAE-VQ clusters after introducing phone-purity guidance.
Paper Structure (12 sections, 6 equations, 2 figures, 3 tables)

This paper contains 12 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of discrete token extraction from fine-tuned HuBERT models. The 256-dim compact continuous speech representations are quantized into discrete tokens via either K-means or VAE-VQ without(-)/with(+) phone purity loss regularization.
  • Figure 2: T-SNE visualization of top 5 largest K-means/VAE-VQ clusters for 4 speakers, XM01, XM07, XM08, and XM11, (from left to right) with or without phone purity guidance (lower and upper rows respectively)