Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations
Mukhtar Mohamed, Oli Danyi Liu, Hao Tang, Sharon Goldwater
TL;DR
Understanding what geometric properties of self-supervised speech representations drive downstream performance is the focus. The authors introduce Cumulative Residual Variance ($CRV$) to quantify orthogonality between phonetic and speaker subspaces and Self-CRV for isotropy, evaluating six SSL models on LibriSpeech with linear probes. They find a strong link between phone probing accuracy and phonetic-subspace isotropy and orthogonality, while isotropy of frame representations is not consistently predictive; centroid-level phonetic isotropy is especially informative. This geometry-driven analysis offers a principled direction for diagnosing and improving self-supervised speech representations for phonetic decoding and speaker discrimination.
Abstract
Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone centroids, and (2) the isotropy of the space, i.e., the degree to which all dimensions are effectively utilized. To study them, we introduce a new measure, Cumulative Residual Variance (CRV), which can be used to assess both properties. Using linear classifiers for speaker and phone ID to probe the representations of six different self-supervised models and two untrained baselines, we ask whether either orthogonality or isotropy correlate with linear probing accuracy. We find that both measures correlate with phonetic probing accuracy, though our results on isotropy are more nuanced.
