A study on the impact of Self-Supervised Learning on automatic dysarthric speech assessment

Xavier F. Cadet; Ranya Aloufi; Sara Ahmadi-Abhari; Hamed Haddadi

A study on the impact of Self-Supervised Learning on automatic dysarthric speech assessment

Xavier F. Cadet, Ranya Aloufi, Sara Ahmadi-Abhari, Hamed Haddadi

TL;DR

It is shown that HuBERT is the most versatile feature extractor across dysarthria classification, word recognition, and intelligibility classification, achieving respectively +24.7%, +61%, and + 7.2% accuracy compared to classical acoustic features.

Abstract

Automating dysarthria assessments offers the opportunity to develop practical, low-cost tools that address the current limitations of manual and subjective assessments. Nonetheless, the small size of most dysarthria datasets makes it challenging to develop automated assessment. Recent research showed that speech representations from models pre-trained on large unlabelled data can enhance Automatic Speech Recognition (ASR) performance for dysarthric speech. We are the first to evaluate the representations from pre-trained state-of-the-art Self-Supervised models across three downstream tasks on dysarthric speech: disease classification, word recognition and intelligibility classification, and under three noise scenarios on the UA-Speech dataset. We show that HuBERT is the most versatile feature extractor across dysarthria classification, word recognition, and intelligibility classification, achieving respectively $+24.7\%, +61\%, \text{and} +7.2\%$ accuracy compared to classical acoustic features.

A study on the impact of Self-Supervised Learning on automatic dysarthric speech assessment

TL;DR

Abstract

accuracy compared to classical acoustic features.

Paper Structure (8 sections, 2 figures, 1 table)

This paper contains 8 sections, 2 figures, 1 table.

Introduction
Dysarthria Automatic Assessments
Methodology
Overview
Experimental Setting
Results
Conclusions and Future Work
Acknowledgements

Figures (2)

Figure 1: The proposed tool overview.
Figure 2: Patient-level predicted intelligibility: The top and bottom row show the predictions using respectively the acoustic features and the HuBERT features. The predictions are reported from left to right based on the environment: Default, Noise Reduction, and Noise Addition datasets. Each intelligibility class is gradient-color coded, from very low intelligibility in blue on the left to control level in red on the right. For each patient the section that stands out indicates the majority predicted intelligibility class, along with its label (Very Low: 0, Low: 1, Medium: 2, High: 3, Control: 4). While the performance based on HuBERT features is higher than acoustic, for a given speaker, there are major mis-classifications at the recording level.

A study on the impact of Self-Supervised Learning on automatic dysarthric speech assessment

TL;DR

Abstract

A study on the impact of Self-Supervised Learning on automatic dysarthric speech assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (2)