Table of Contents
Fetching ...

Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features

Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Takashi Tsuboi, Yasuhiro Tanaka, Daisuke Nakatsubo, Satoshi Maesawa, Ryuta Saito, Masahisa Katsuno, Hiroaki Kudo

TL;DR

This work addresses the challenge of objective voice-quality assessment in patients with impaired vocal systems despite limited data by leveraging ASR representations from Whisper and SSL representations from HuBERT, pre-trained on large normal-speech corpora. It proposes a feature-fusion pipeline that combines ASR, SSL, and mel-spectrogram features through adapters, followed by a downstream LSTM-FC module to predict all GRBAS indicators, evaluated on PVQD (English) and STN-DBS (Japanese) datasets. The results show strong correlations ($PCC$) and reduced $MSE$ across Grade and GRBAS tasks, with running speech offering particularly robust improvements, and demonstrate initial applicability to PD patients undergoing STN-DBS. Overall, the approach provides a practical, objective tool for clinical voice quality monitoring and motivates future multi-modal extensions.

Abstract

The potential of deep learning in clinical speech processing is immense, yet the hurdles of limited and imbalanced clinical data samples loom large. This article addresses these challenges by showcasing the utilization of automatic speech recognition and self-supervised learning representations, pre-trained on extensive datasets of normal speech. This innovative approach aims to estimate voice quality of patients with impaired vocal systems. Experiments involve checks on PVQD dataset, covering various causes of vocal system damage in English, and a Japanese dataset focusing on patients with Parkinson's disease before and after undergoing subthalamic nucleus deep brain stimulation (STN-DBS) surgery. The results on PVQD reveal a notable correlation (>0.8 on PCC) and an extraordinary accuracy (<0.5 on MSE) in predicting Grade, Breathy, and Asthenic indicators. Meanwhile, progress has been achieved in predicting the voice quality of patients in the context of STN-DBS.

Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features

TL;DR

This work addresses the challenge of objective voice-quality assessment in patients with impaired vocal systems despite limited data by leveraging ASR representations from Whisper and SSL representations from HuBERT, pre-trained on large normal-speech corpora. It proposes a feature-fusion pipeline that combines ASR, SSL, and mel-spectrogram features through adapters, followed by a downstream LSTM-FC module to predict all GRBAS indicators, evaluated on PVQD (English) and STN-DBS (Japanese) datasets. The results show strong correlations () and reduced across Grade and GRBAS tasks, with running speech offering particularly robust improvements, and demonstrate initial applicability to PD patients undergoing STN-DBS. Overall, the approach provides a practical, objective tool for clinical voice quality monitoring and motivates future multi-modal extensions.

Abstract

The potential of deep learning in clinical speech processing is immense, yet the hurdles of limited and imbalanced clinical data samples loom large. This article addresses these challenges by showcasing the utilization of automatic speech recognition and self-supervised learning representations, pre-trained on extensive datasets of normal speech. This innovative approach aims to estimate voice quality of patients with impaired vocal systems. Experiments involve checks on PVQD dataset, covering various causes of vocal system damage in English, and a Japanese dataset focusing on patients with Parkinson's disease before and after undergoing subthalamic nucleus deep brain stimulation (STN-DBS) surgery. The results on PVQD reveal a notable correlation (>0.8 on PCC) and an extraordinary accuracy (<0.5 on MSE) in predicting Grade, Breathy, and Asthenic indicators. Meanwhile, progress has been achieved in predicting the voice quality of patients in the context of STN-DBS.
Paper Structure (22 sections, 1 equation, 4 figures, 5 tables)

This paper contains 22 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Schematic diagram of the proposed method.
  • Figure 2: Scatter plots of Grade prediction of patient level on PVQD-S (top row) and PVQD-A (bottom row). The red lines and red shaded areas represent the regression lines and their 95% confidence interval. The green shadows display the region of error less than 0.5 during auditory-perceptual judgment when discrete scores are rated.
  • Figure 3: Confusion matrix of predicting Grade on STN-DBS.
  • Figure 4: Visualization from PVQD (a) and STN-DBS (b).