Table of Contents
Fetching ...

Multiview Canonical Correlation Analysis for Automatic Pathological Speech Detection

Yacouba Kaloga, Shakeel A. Sheikh, Ina Kodrasi

TL;DR

The paper tackles automatic pathological speech detection by removing pathology-irrelevant cues from input representations using Multiview Canonical Correlation Analysis across time chunks. By projecting spectrograms and wav2vec2 embeddings into a shared low-dimensional space (S*) and using simple classifiers like MLP or LGBM, the approach preserves pathology-discriminant cues while suppressing uncorrelated temporal variations. Experiments on Spanish PD versus neurotypical speech show that MCCA consistently improves performance over PCA, with notable gains for spectrogram inputs and competitive results for SSL-based features, all while maintaining interpretability. The work highlights practical benefits for data-efficient, interpretable clinical tools and suggests avenues for future non-linear MCCA methods and robustness studies.

Abstract

Recently proposed automatic pathological speech detection approaches rely on spectrogram input representations or wav2vec2 embeddings. These representations may contain pathology irrelevant uncorrelated information, such as changing phonetic content or variations in speaking style across time, which can adversely affect classification performance. To address this issue, we propose to use Multiview Canonical Correlation Analysis (MCCA) on these input representations prior to automatic pathological speech detection. Our results demonstrate that unlike other dimensionality reduction techniques, the use of MCCA leads to a considerable improvement in pathological speech detection performance by eliminating uncorrelated information present in the input representations. Employing MCCA with traditional classifiers yields a comparable or higher performance than using sophisticated architectures, while preserving the representation structure and providing interpretability.

Multiview Canonical Correlation Analysis for Automatic Pathological Speech Detection

TL;DR

The paper tackles automatic pathological speech detection by removing pathology-irrelevant cues from input representations using Multiview Canonical Correlation Analysis across time chunks. By projecting spectrograms and wav2vec2 embeddings into a shared low-dimensional space (S*) and using simple classifiers like MLP or LGBM, the approach preserves pathology-discriminant cues while suppressing uncorrelated temporal variations. Experiments on Spanish PD versus neurotypical speech show that MCCA consistently improves performance over PCA, with notable gains for spectrogram inputs and competitive results for SSL-based features, all while maintaining interpretability. The work highlights practical benefits for data-efficient, interpretable clinical tools and suggests avenues for future non-linear MCCA methods and robustness studies.

Abstract

Recently proposed automatic pathological speech detection approaches rely on spectrogram input representations or wav2vec2 embeddings. These representations may contain pathology irrelevant uncorrelated information, such as changing phonetic content or variations in speaking style across time, which can adversely affect classification performance. To address this issue, we propose to use Multiview Canonical Correlation Analysis (MCCA) on these input representations prior to automatic pathological speech detection. Our results demonstrate that unlike other dimensionality reduction techniques, the use of MCCA leads to a considerable improvement in pathological speech detection performance by eliminating uncorrelated information present in the input representations. Employing MCCA with traditional classifiers yields a comparable or higher performance than using sophisticated architectures, while preserving the representation structure and providing interpretability.
Paper Structure (9 sections, 2 equations, 4 figures, 2 tables)

This paper contains 9 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: (a) Spectrogram of a $500$ ms long speech segment with $257$ frequency bins representing the frequency range from $0$ Hz to $8000$ Hz. The vertical lines denote the boundary of $M = 8$ chunks, with each chunk considered to be a single view of the speech segment. (b) Representation after applying MCCA, corresponding to the same $257$ bins and containing $T = 5$ components.
  • Figure 2: MLP and LGBM performance for different chunk sizes $M$ with $5$ and $10$ MCCA components using (a) spectrogram and (b) w2v2 embeddings.
  • Figure 3: LGBM performance on the validation and test sets using different % of top-ranked features for (a) spectrogram and (b) w2v2 embeddings. For ease of comparison, the performance on the test set using all features is also illustrated.
  • Figure 4: MCCA components of the frequency bins belonging to the $1.5$% top-ranked features. The color map illustrates the importance assigned to each bin when using LGBM.