Unveiling Interpretability in Self-Supervised Speech Representations for Parkinson's Diagnosis
David Gimeno-Gómez, Catarina Botelho, Anna Pompili, Alberto Abad, Carlos-D. Martínez-Hinarejos
TL;DR
This work targets the interpretability gap in self-supervised speech representations for Parkinson's disease diagnosis. It introduces an interpretable cross-attention framework that fuses SSL embeddings from Wav2Vec2.0 with a compact set of 35 clinically informed features, producing embedding-level ($S_{emb}$) and temporal-level ($S_{temp}$) explanations via cross-attention defined by $Attention(Q,K,V) = \mathrm{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right) V$. Evaluated on five multilingual PD speech benchmarks across six assessment tasks, the method achieves competitive accuracy while providing interpretable insights that align with clinical dimensions like phonation, articulation, and prosody, and demonstrates cross-lingual robustness in spontaneous speech. The study discusses the trade-off between accuracy and transparency, presents detailed embedding- and temporal-level analyses, and outlines limitations and future directions, including broader pathologies and clinician validation to enhance trust in computer-assisted diagnosis systems.
Abstract
Recent works in pathological speech analysis have increasingly relied on powerful self-supervised speech representations, leading to promising results. However, the complex, black-box nature of these embeddings and the limited research on their interpretability significantly restrict their adoption for clinical diagnosis. To address this gap, we propose a novel, interpretable framework specifically designed to support Parkinson's Disease (PD) diagnosis. Through the design of simple yet effective cross-attention mechanisms for both embedding- and temporal-level analysis, the proposed framework offers interpretability from two distinct but complementary perspectives. Experimental findings across five well-established speech benchmarks for PD detection demonstrate the framework's capability to identify meaningful speech patterns within self-supervised representations for a wide range of assessment tasks. Fine-grained temporal analyses further underscore its potential to enhance the interpretability of deep-learning pathological speech models, paving the way for the development of more transparent, trustworthy, and clinically applicable computer-assisted diagnosis systems in this domain. Moreover, in terms of classification accuracy, our method achieves results competitive with state-of-the-art approaches, while also demonstrating robustness in cross-lingual scenarios when applied to spontaneous speech production.
