Table of Contents
Fetching ...

Unveiling Interpretability in Self-Supervised Speech Representations for Parkinson's Diagnosis

David Gimeno-Gómez, Catarina Botelho, Anna Pompili, Alberto Abad, Carlos-D. Martínez-Hinarejos

TL;DR

This work targets the interpretability gap in self-supervised speech representations for Parkinson's disease diagnosis. It introduces an interpretable cross-attention framework that fuses SSL embeddings from Wav2Vec2.0 with a compact set of 35 clinically informed features, producing embedding-level ($S_{emb}$) and temporal-level ($S_{temp}$) explanations via cross-attention defined by $Attention(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right) V$. Evaluated on five multilingual PD speech benchmarks across six assessment tasks, the method achieves competitive accuracy while providing interpretable insights that align with clinical dimensions like phonation, articulation, and prosody, and demonstrates cross-lingual robustness in spontaneous speech. The study discusses the trade-off between accuracy and transparency, presents detailed embedding- and temporal-level analyses, and outlines limitations and future directions, including broader pathologies and clinician validation to enhance trust in computer-assisted diagnosis systems.

Abstract

Recent works in pathological speech analysis have increasingly relied on powerful self-supervised speech representations, leading to promising results. However, the complex, black-box nature of these embeddings and the limited research on their interpretability significantly restrict their adoption for clinical diagnosis. To address this gap, we propose a novel, interpretable framework specifically designed to support Parkinson's Disease (PD) diagnosis. Through the design of simple yet effective cross-attention mechanisms for both embedding- and temporal-level analysis, the proposed framework offers interpretability from two distinct but complementary perspectives. Experimental findings across five well-established speech benchmarks for PD detection demonstrate the framework's capability to identify meaningful speech patterns within self-supervised representations for a wide range of assessment tasks. Fine-grained temporal analyses further underscore its potential to enhance the interpretability of deep-learning pathological speech models, paving the way for the development of more transparent, trustworthy, and clinically applicable computer-assisted diagnosis systems in this domain. Moreover, in terms of classification accuracy, our method achieves results competitive with state-of-the-art approaches, while also demonstrating robustness in cross-lingual scenarios when applied to spontaneous speech production.

Unveiling Interpretability in Self-Supervised Speech Representations for Parkinson's Diagnosis

TL;DR

This work targets the interpretability gap in self-supervised speech representations for Parkinson's disease diagnosis. It introduces an interpretable cross-attention framework that fuses SSL embeddings from Wav2Vec2.0 with a compact set of 35 clinically informed features, producing embedding-level () and temporal-level () explanations via cross-attention defined by . Evaluated on five multilingual PD speech benchmarks across six assessment tasks, the method achieves competitive accuracy while providing interpretable insights that align with clinical dimensions like phonation, articulation, and prosody, and demonstrates cross-lingual robustness in spontaneous speech. The study discusses the trade-off between accuracy and transparency, presents detailed embedding- and temporal-level analyses, and outlines limitations and future directions, including broader pathologies and clinician validation to enhance trust in computer-assisted diagnosis systems.

Abstract

Recent works in pathological speech analysis have increasingly relied on powerful self-supervised speech representations, leading to promising results. However, the complex, black-box nature of these embeddings and the limited research on their interpretability significantly restrict their adoption for clinical diagnosis. To address this gap, we propose a novel, interpretable framework specifically designed to support Parkinson's Disease (PD) diagnosis. Through the design of simple yet effective cross-attention mechanisms for both embedding- and temporal-level analysis, the proposed framework offers interpretability from two distinct but complementary perspectives. Experimental findings across five well-established speech benchmarks for PD detection demonstrate the framework's capability to identify meaningful speech patterns within self-supervised representations for a wide range of assessment tasks. Fine-grained temporal analyses further underscore its potential to enhance the interpretability of deep-learning pathological speech models, paving the way for the development of more transparent, trustworthy, and clinically applicable computer-assisted diagnosis systems in this domain. Moreover, in terms of classification accuracy, our method achieves results competitive with state-of-the-art approaches, while also demonstrating robustness in cross-lingual scenarios when applied to spontaneous speech production.

Paper Structure

This paper contains 19 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The overall architecture of our proposed framework for PD diagnosis support, as well as the motivations behind each interpretable module design.
  • Figure 2: Attention-based relevance scores from the embedding interpretability perspective. For each assessment tasks considered in our study, the figure presents the averaged scores of the 35 selected informed speech features across Healthy Control (HC) and Parkinson's Disease (PD) groups of the GITA test set.
  • Figure 3: Embedding-level cross-attention alignment showing the difference between the averaged attention scores for Healthy Control (HC) and Parkinson's Disease (PD) groups in the DDK task of the GITA test set.
  • Figure 4: Temporal Contrastive Analysis of the GITA subject no. 15, diagnosed with Parkinson's Disease, during the SENTENCES tasks. The analysis focuses on the phonetically balanced phrase "Mi casa tiene tres cuartos" (Spanish for "My house has three rooms").