Table of Contents
Fetching ...

SAND Challenge: Four Approaches for Dysartria Severity Classification

Gauri Deshpande, Harish Battula, Ashish Panda, Sunil Kumar Kopparapu

TL;DR

The paper compares four distinct approaches to 5-class dysarthria severity classification in the SAND Task #1 using the same dataset and utterance set. It finds that a feature-based hierarchical XGBoost pipeline, leveraging glottal and formant features with a two-stage cascade, delivers the strongest macro-F1 (~0.86), while deep learning variants (ViT-Ave, 1D-CNN, BiLSTM-of) achieve competitive macro-F1 scores (~0.68–0.70) and provide complementary insights into speech impairment. The study highlights the benefits of domain knowledge and tailored fusion strategies in low-data regimes, and suggests potential for hybrid models that fuse engineered features with neural representations. Overall, the results demonstrate that combining diverse strategies—end-to-end learning and expert-feature methods—offers robust dysarthria classification under challenging data conditions, with clear avenues for future improvement through hybrids and larger datasets.

Abstract

This paper presents a unified study of four distinct modeling approaches for classifying dysarthria severity in the Speech Analysis for Neurodegenerative Diseases (SAND) challenge. All models tackle the same five class classification task using a common dataset of speech recordings. We investigate: (1) a ViT-OF method leveraging a Vision Transformer on spectrogram images, (2) a 1D-CNN approach using eight 1-D CNN's with majority-vote fusion, (3) a BiLSTM-OF approach using nine BiLSTM models with majority vote fusion, and (4) a Hierarchical XGBoost ensemble that combines glottal and formant features through a two stage learning framework. Each method is described, and their performances on a validation set of 53 speakers are compared. Results show that while the feature-engineered XGBoost ensemble achieves the highest macro-F1 (0.86), the deep learning models (ViT, CNN, BiLSTM) attain competitive F1-scores (0.70) and offer complementary insights into the problem.

SAND Challenge: Four Approaches for Dysartria Severity Classification

TL;DR

The paper compares four distinct approaches to 5-class dysarthria severity classification in the SAND Task #1 using the same dataset and utterance set. It finds that a feature-based hierarchical XGBoost pipeline, leveraging glottal and formant features with a two-stage cascade, delivers the strongest macro-F1 (~0.86), while deep learning variants (ViT-Ave, 1D-CNN, BiLSTM-of) achieve competitive macro-F1 scores (~0.68–0.70) and provide complementary insights into speech impairment. The study highlights the benefits of domain knowledge and tailored fusion strategies in low-data regimes, and suggests potential for hybrid models that fuse engineered features with neural representations. Overall, the results demonstrate that combining diverse strategies—end-to-end learning and expert-feature methods—offers robust dysarthria classification under challenging data conditions, with clear avenues for future improvement through hybrids and larger datasets.

Abstract

This paper presents a unified study of four distinct modeling approaches for classifying dysarthria severity in the Speech Analysis for Neurodegenerative Diseases (SAND) challenge. All models tackle the same five class classification task using a common dataset of speech recordings. We investigate: (1) a ViT-OF method leveraging a Vision Transformer on spectrogram images, (2) a 1D-CNN approach using eight 1-D CNN's with majority-vote fusion, (3) a BiLSTM-OF approach using nine BiLSTM models with majority vote fusion, and (4) a Hierarchical XGBoost ensemble that combines glottal and formant features through a two stage learning framework. Each method is described, and their performances on a validation set of 53 speakers are compared. Results show that while the feature-engineered XGBoost ensemble achieves the highest macro-F1 (0.86), the deep learning models (ViT, CNN, BiLSTM) attain competitive F1-scores (0.70) and offer complementary insights into the problem.

Paper Structure

This paper contains 20 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Spectrogram of ID300 for the rhythm KA.
  • Figure 2: Block diagram of hierarchical XGBoost approach.
  • Figure 3: Glottal parameters extracted from speech signals for speakers $'ID077'$ (Class $1$) and $'ID000'$ (Class $5$).
  • Figure 4: Architecture for late fusion model.
  • Figure 5: Sample train and validation F1-score versus epoch for ${C_{+}}$ and rhythm TA.

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3