Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition
Huimeng Wang, Xurong Xie, Mengzhe Geng, Shujie Hu, Haoning Xu, Youjun Chen, Zhaoqing Li, Jiajun Deng, Xunying Liu
TL;DR
The paper tackles dysarthric speech recognition by leveraging phone-purity guided discrete tokens derived from HuBERT representations. By regularizing K-means and VAE-VQ token extraction with frame-level phonetic labels, the authors achieve sharper clustering and higher phonetic discriminability, yielding consistent WER reductions on the UASpeech dataset. The approach delivers up to 0.99% absolute and 1.77% absolute improvements for TDNN and Conformer baselines, respectively, with a best combined WER of 23.25%, and it is supported by gains in the phone purity metric and visualizations. This work demonstrates that supervised guidance during discrete token extraction can meaningfully improve dysarthric ASR and offers a pathway for more robust, domain-adaptive token-based speech representations.
Abstract
Discrete tokens extracted provide efficient and domain adaptable speech features. Their application to disordered speech that exhibits articulation imprecision and large mismatch against normal voice remains unexplored. To improve their phonetic discrimination that is weakened during unsupervised K-means or vector quantization of continuous features, this paper proposes novel phone-purity guided (PPG) discrete tokens for dysarthric speech recognition. Phonetic label supervision is used to regularize maximum likelihood and reconstruction error costs used in standard K-means and VAE-VQ based discrete token extraction. Experiments conducted on the UASpeech corpus suggest that the proposed PPG discrete token features extracted from HuBERT consistently outperform hybrid TDNN and End-to-End (E2E) Conformer systems using non-PPG based K-means or VAE-VQ tokens across varying codebook sizes by statistically significant word error rate (WER) reductions up to 0.99\% and 1.77\% absolute (3.21\% and 4.82\% relative) respectively on the UASpeech test set of 16 dysarthric speakers. The lowest WER of 23.25\% was obtained by combining systems using different token features. Consistent improvements on the phone purity metric were also achieved. T-SNE visualization further demonstrates sharper decision boundaries were produced between K-means/VAE-VQ clusters after introducing phone-purity guidance.
