Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

Kyle Janse van Rensburg; Benjamin van Niekerk; Herman Kamper

Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

Kyle Janse van Rensburg, Benjamin van Niekerk, Herman Kamper

TL;DR

Using WavLM, this paper finds that the principal dimension that explains most variance encodes pitch and associated characteristics like gender, and shows that most characteristics can be controlled by changing the corresponding dimensions in synthesis applications.

Abstract

How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. Using WavLM, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. Finally, in synthesis experiments we show that most characteristics can be controlled by changing the corresponding dimensions. This provides a simple method to control characteristics of the output voice in synthesis applications.

Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

TL;DR

Abstract

Paper Structure (10 sections, 1 equation, 4 figures)

This paper contains 10 sections, 1 equation, 4 figures.

Introduction
Methodology
Speaker characteristics
Principal component analysis on SSL features
Correlation analysis
Analysis of Principal Dimensions
Towards Control by Manipulating Dimensions
Experimental setup
Results
Conclusion

Figures (4)

Figure 1: (a): Scatter plot showing the linear relationship between principal dimension 2 and intensity, with $R^2 = 0.40$. (b): Violin plots showing the distributions of principal dimension 1 separately per gender, with $\kappa = 0.96$. Training data is shown.
Figure 2: Heat map showing correlation scores between speaker-specific characteristics and principal dimensions for WavLM-Large layer 6. Development data is shown.
Figure 3: The effect of measured characteristics as particular principal dimensions are varied. The blue line shows the average characteristic as a dimension is varied across all utterances in the test set, with the shaded area indicating one standard deviation for that characteristic. The dotted-green and dashed-orange lines show changes for specific utterances that have, respectively, high and low characteristic values before modification.
Figure 4: An illustration of how varying a principal dimension affects other characteristics than the one that it is mainly correlated with. Here specifically, principal dimension 1 (associated with pitch) is varied, but intensity is also measured. The blue line shows the average pitch and the orange line shows the average intensity over all utterances in the test set as the dimension changes. The shaded areas indicate one standard deviation of the characteristic.

Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

TL;DR

Abstract

Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

Authors

TL;DR

Abstract

Table of Contents

Figures (4)