Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

Mukhtar Mohamed; Oli Danyi Liu; Hao Tang; Sharon Goldwater

Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

Mukhtar Mohamed, Oli Danyi Liu, Hao Tang, Sharon Goldwater

TL;DR

Understanding what geometric properties of self-supervised speech representations drive downstream performance is the focus. The authors introduce Cumulative Residual Variance ($CRV$) to quantify orthogonality between phonetic and speaker subspaces and Self-CRV for isotropy, evaluating six SSL models on LibriSpeech with linear probes. They find a strong link between phone probing accuracy and phonetic-subspace isotropy and orthogonality, while isotropy of frame representations is not consistently predictive; centroid-level phonetic isotropy is especially informative. This geometry-driven analysis offers a principled direction for diagnosing and improving self-supervised speech representations for phonetic decoding and speaker discrimination.

Abstract

Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone centroids, and (2) the isotropy of the space, i.e., the degree to which all dimensions are effectively utilized. To study them, we introduce a new measure, Cumulative Residual Variance (CRV), which can be used to assess both properties. Using linear classifiers for speaker and phone ID to probe the representations of six different self-supervised models and two untrained baselines, we ask whether either orthogonality or isotropy correlate with linear probing accuracy. We find that both measures correlate with phonetic probing accuracy, though our results on isotropy are more nuanced.

Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

TL;DR

Understanding what geometric properties of self-supervised speech representations drive downstream performance is the focus. The authors introduce Cumulative Residual Variance (

) to quantify orthogonality between phonetic and speaker subspaces and Self-CRV for isotropy, evaluating six SSL models on LibriSpeech with linear probes. They find a strong link between phone probing accuracy and phonetic-subspace isotropy and orthogonality, while isotropy of frame representations is not consistently predictive; centroid-level phonetic isotropy is especially informative. This geometry-driven analysis offers a principled direction for diagnosing and improving self-supervised speech representations for phonetic decoding and speaker discrimination.

Abstract

Paper Structure (13 sections, 3 figures)

This paper contains 13 sections, 3 figures.

Introduction
Isotropy and orthogonality
Measuring orthogonality
Cosine similarity between principal directions
Cumulative Residual Variance (CRV)
Evaluating isotropy with Self-CRV
Experimental Setup
Results and Discussion
Layerwise Classification Accuracy
Geometry of the phone and speaker subspaces
Isotropy of the frame representation space
Conclusion
Acknowledgements

Figures (3)

Figure 1: Evaluating orthogonality as cosine similarities between principal components (a-c) versus using residual variance (d). Isotropy can be evaluated with self residual variance (e).
Figure 2: Layerwise results for all models, showing (in columns from left to right): Phone and speaker classiciation accuracy; CRV orthogonality measures (Ph\\Spk and Spk\\Ph); and self-CRV measures (Ph\\Ph and Spk\\Spk).
Figure 3: Correlations between orthogonality or isotropy measures and probing accuracies. Marker styles are as in Fig. \ref{['fig:phone and speaker ACC']}.

Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

TL;DR

Abstract

Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

Authors

TL;DR

Abstract

Table of Contents

Figures (3)