Table of Contents
Fetching ...

Reference-free automatic speech severity evaluation using acoustic unit language modelling

Bence Mark Halpern, Tomoki Toda

TL;DR

This paper tackles the need for ecologically valid, reference-free speech severity evaluation by introducing SpeechLMScore, a method that leverages acoustic-unit language modeling derived from self-supervised speech representations. SpeechLMScore quantizes HuBERT features into discrete units and uses an LSTM trained on LibriLight to model the probability of unit sequences, with perplexity serving as a severity proxy; layer 1 representations yielded the best signal. Evaluated on the NKI-SpeechRT and NKI-OC-VC Dutch datasets, SpeechLMScore outperforms traditional acoustic features and approaches, though generally remains below the strong reference-based upper bound provided by phoneme error rate (PER), particularly in the NC-SpeechRT data. The work also demonstrates robustness to noise, aided by the inclusion of noisiness ratings, and discusses limitations and future directions, including more Dutch healthy data for LM retraining and interpretability improvements. Overall, the study provides a practical, reference-free approach to severity estimation with potential for real-world clinical deployment and scalable triage of speech disorders.

Abstract

Speech severity evaluation is becoming increasingly important as the economic burden of speech disorders grows. Current speech severity models often struggle with generalization, learning dataset-specific acoustic cues rather than meaningful correlates of speech severity. Furthermore, many models require reference speech or a transcript, limiting their applicability in ecologically valid scenarios, such as spontaneous speech evaluation. Previous research indicated that automatic speech naturalness evaluation scores correlate strongly with severity evaluation scores, leading us to explore a reference-free method, SpeechLMScore, which does not rely on pathological speech data. Additionally, we present the NKI-SpeechRT dataset, based on the NKI-CCRT dataset, to provide a more comprehensive foundation for speech severity evaluation. This study evaluates whether SpeechLMScore outperforms traditional acoustic feature-based approaches and assesses the performance gap between reference-free and reference-based models. Moreover, we examine the impact of noise on these models by utilizing subjective noise ratings in the NKI-SpeechRT dataset. The results demonstrate that SpeechLMScore is robust to noise and offers superior performance compared to traditional approaches.

Reference-free automatic speech severity evaluation using acoustic unit language modelling

TL;DR

This paper tackles the need for ecologically valid, reference-free speech severity evaluation by introducing SpeechLMScore, a method that leverages acoustic-unit language modeling derived from self-supervised speech representations. SpeechLMScore quantizes HuBERT features into discrete units and uses an LSTM trained on LibriLight to model the probability of unit sequences, with perplexity serving as a severity proxy; layer 1 representations yielded the best signal. Evaluated on the NKI-SpeechRT and NKI-OC-VC Dutch datasets, SpeechLMScore outperforms traditional acoustic features and approaches, though generally remains below the strong reference-based upper bound provided by phoneme error rate (PER), particularly in the NC-SpeechRT data. The work also demonstrates robustness to noise, aided by the inclusion of noisiness ratings, and discusses limitations and future directions, including more Dutch healthy data for LM retraining and interpretability improvements. Overall, the study provides a practical, reference-free approach to severity estimation with potential for real-world clinical deployment and scalable triage of speech disorders.

Abstract

Speech severity evaluation is becoming increasingly important as the economic burden of speech disorders grows. Current speech severity models often struggle with generalization, learning dataset-specific acoustic cues rather than meaningful correlates of speech severity. Furthermore, many models require reference speech or a transcript, limiting their applicability in ecologically valid scenarios, such as spontaneous speech evaluation. Previous research indicated that automatic speech naturalness evaluation scores correlate strongly with severity evaluation scores, leading us to explore a reference-free method, SpeechLMScore, which does not rely on pathological speech data. Additionally, we present the NKI-SpeechRT dataset, based on the NKI-CCRT dataset, to provide a more comprehensive foundation for speech severity evaluation. This study evaluates whether SpeechLMScore outperforms traditional acoustic feature-based approaches and assesses the performance gap between reference-free and reference-based models. Moreover, we examine the impact of noise on these models by utilizing subjective noise ratings in the NKI-SpeechRT dataset. The results demonstrate that SpeechLMScore is robust to noise and offers superior performance compared to traditional approaches.

Paper Structure

This paper contains 15 sections, 3 equations, 2 tables.