Geometric Analysis of Speech Representation Spaces: Topological Disentanglement and Confound Detection

Bipasha Kashyap; Pubudu N. Pathirana

Geometric Analysis of Speech Representation Spaces: Topological Disentanglement and Confound Detection

Bipasha Kashyap, Pubudu N. Pathirana

TL;DR

A four-metric clustering framework to evaluate geometric disentanglement of emotional, linguistic, and pathological speech features across six corpora and eight dataset combinations provides actionable guidelines for equitable and reliable speech health systems across diverse populations.

Abstract

Speech-based clinical tools are increasingly deployed in multilingual settings, yet whether pathological speech markers remain geometrically separable from accent variation remains unclear. Systems may misclassify healthy non-native speakers or miss pathology in multilingual patients. We propose a four-metric clustering framework to evaluate geometric disentanglement of emotional, linguistic, and pathological speech features across six corpora and eight dataset combinations. A consistent hierarchy emerges: emotional features form the tightest clusters (Silhouette 0.250), followed by pathological (0.141) and linguistic (0.077). Confound analysis shows pathological-linguistic overlap remains below 0.21, which is above the permutation null but bounded for clinical deployment. Trustworthiness analysis confirms embedding fidelity and robustness of the geometric conclusions. Our framework provides actionable guidelines for equitable and reliable speech health systems across diverse populations.

Geometric Analysis of Speech Representation Spaces: Topological Disentanglement and Confound Detection

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 3 figures, 2 tables)

This paper contains 19 sections, 7 equations, 3 figures, 2 tables.

Introduction
Related Work
Speech Representation Disentanglement
Clustering Quality Assessment
Clinical Speech Assessment in Multilingual Settings
Methodology
Feature Extraction
Manifold Learning via t-SNE
Clustering Quality Metrics
Confound Detection
Experimental Setup
Datasets
Implementation
Results and Discussion
Per-Dimension Clustering Quality
...and 4 more sections

Figures (3)

Figure 1: Topological analysis framework. Overview: multi-dimensional features (emotional $\mathbb{R}^{28}$, linguistic $\mathbb{R}^{33}$, pathological $\mathbb{R}^{16}$) undergo t-SNE embedding, followed by three branches: (A) clustering quality via four metrics, (B) bootstrap stability ($B = 20$, 80% subsampling), and (C) confound detection via $2\sigma$ overlap with a permutation null in shared PCA space.
Figure 2: t-SNE embeddings ($\mathbb{R}^2$, perplexity = 30) across all eight combinations. Per-dimension Gaussian kernel density contour silverman1986density (bandwidth=0.4, 30% maximum density isoline) highlight manifold extent; filled regions show core density. Silhouette score badges (lower right) quantify per-panel clustering quality. The emotional-dominant hierarchy is consistent across all combinations, with pathological features forming intermediate clusters and linguistic features showing the most diffuse structure.
Figure 3: Pathological–linguistic overlap (Eq. 6) across eight combinations, versus permutation null. The shaded region marks the 90% confidence interval of the permutation null good2005permutation ($\mu_{\text{null}} \approx 0.06$, dotted). All observed values exceed the null baseline, confirming genuine shared structure, yet remain bounded ($< 0.21$). The solid line denotes the observed mean ($\mu_{\text{obs}} = 0.169$). L2A pairings show higher overlap than GMU pairings, suggesting that accent diversity improves separation.

Geometric Analysis of Speech Representation Spaces: Topological Disentanglement and Confound Detection

TL;DR

Abstract

Geometric Analysis of Speech Representation Spaces: Topological Disentanglement and Confound Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)