Table of Contents
Fetching ...

Probing Whisper for Dysarthric Speech in Detection and Assessment

Zhengjun Yue, Devendra Kayande, Zoran Cvetkovic, Erfan Loweimi

TL;DR

This work investigates how Whisper-M encodes dysarthric speech for detection and severity classification by conducting layer-wise probing across all encoder layers. It trains linear classifiers on each layer's embeddings under single-task and multi-task setups and evaluates representations with accuracy, F1, mutual information, and Silhouette scores, also comparing pretrained versus fine-tuned models. The key finding is that mid-level encoder layers (13–15) are most informative across metrics, while fine-tuning induces only modest changes, especially in higher layers, with multitask learning offering little benefit. The results advance interpretability of large pretrained models in clinical contexts and demonstrate a practical approach to selecting layer representations for pathological speech tasks using probing analyses.

Abstract

Large-scale end-to-end models such as Whisper have shown strong performance on diverse speech tasks, but their internal behavior on pathological speech remains poorly understood. Understanding how dysarthric speech is represented across layers is critical for building reliable and explainable clinical assessment tools. This study probes the Whisper-Medium model encoder for dysarthric speech for detection and assessment (i.e., severity classification). We evaluate layer-wise embeddings with a linear classifier under both single-task and multi-task settings, and complement these results with Silhouette scores and mutual information to provide perspectives on layer informativeness. To examine adaptability, we repeat the analysis after fine-tuning Whisper on a dysarthric speech recognition task. Across metrics, the mid-level encoder layers (13-15) emerge as most informative, while fine-tuning induces only modest changes. The findings improve the interpretability of Whisper's embeddings and highlight the potential of probing analyses to guide the use of large-scale pretrained models for pathological speech.

Probing Whisper for Dysarthric Speech in Detection and Assessment

TL;DR

This work investigates how Whisper-M encodes dysarthric speech for detection and severity classification by conducting layer-wise probing across all encoder layers. It trains linear classifiers on each layer's embeddings under single-task and multi-task setups and evaluates representations with accuracy, F1, mutual information, and Silhouette scores, also comparing pretrained versus fine-tuned models. The key finding is that mid-level encoder layers (13–15) are most informative across metrics, while fine-tuning induces only modest changes, especially in higher layers, with multitask learning offering little benefit. The results advance interpretability of large pretrained models in clinical contexts and demonstrate a practical approach to selecting layer representations for pathological speech tasks using probing analyses.

Abstract

Large-scale end-to-end models such as Whisper have shown strong performance on diverse speech tasks, but their internal behavior on pathological speech remains poorly understood. Understanding how dysarthric speech is represented across layers is critical for building reliable and explainable clinical assessment tools. This study probes the Whisper-Medium model encoder for dysarthric speech for detection and assessment (i.e., severity classification). We evaluate layer-wise embeddings with a linear classifier under both single-task and multi-task settings, and complement these results with Silhouette scores and mutual information to provide perspectives on layer informativeness. To examine adaptability, we repeat the analysis after fine-tuning Whisper on a dysarthric speech recognition task. Across metrics, the mid-level encoder layers (13-15) emerge as most informative, while fine-tuning induces only modest changes. The findings improve the interpretability of Whisper's embeddings and highlight the potential of probing analyses to guide the use of large-scale pretrained models for pathological speech.

Paper Structure

This paper contains 16 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Workflow of layer probing on Whisper-M encoder layers' embeddings using the single and multi-task approaches.
  • Figure 2: Detection results for probing pretrained Whisper embeddings (single-task vs. multi-task).
  • Figure 3: Detection accuracy error bar for probing pretrained and finetuned Whisper embeddings.
  • Figure 4: Layer-wise MI between (a) Whisper PT embeddings and labels, (b) PT and FT whisper embeddings.
  • Figure 5: Silhouette score between layer-wise PT whisper embeddings and labels for the detection task.