Table of Contents
Fetching ...

[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David Harwath, David R. Mortensen

TL;DR

A comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors, indicating that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic.

Abstract

Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic-arithmetic .

[b]=[d]-[t]+[p]: Self-supervised Speech Models Discover Phonological Vector Arithmetic

TL;DR

A comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors, indicating that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic.

Abstract

Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic-arithmetic .
Paper Structure (65 sections, 16 equations, 19 figures, 1 table)

This paper contains 65 sections, 16 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Comparing analogies for text and speech. Word representations capture semantic relations, while speech representations capture phonological relations. Such analogies (\ref{['sec:direction']}) can be used to control speech synthesis in a phonologically grounded manner (\ref{['sec:degree']}).
  • Figure 2: Comparing S3Ms with spectral representations on TIMIT (top) and VoxAngeles (bottom).
  • Figure 3: Comparing consonant-only and vowel-only quadruplets on TIMIT (top) and VoxAngeles (bottom) for WavLM. Number within the parenthesis denotes the number of quadruplets. We exclude cases where a quadruplet contains both consonants and vowels.
  • Figure 4: Comparison between the phonological vector scale $\lambda$ and the acoustic measurements (\ref{['ss:measure']}) on TIMIT. $\rho$ denotes Spearman's rank correlation coefficient. Blue and orange plots indicate vowels and consonants, respectively. The empirically observed correlation signs match the theoretical expectations shown in \ref{['tb:feat-summary']}. Further, plots demonstrate the linearity of phonological vectors, resulting in monotonic (but not necessarily linear) changes in acoustic measurements.
  • Figure 5: Applying round vector to front vowel [i], where there is no front rounded vowel in English. Orange and blue arrows denote F2 and F3, respectively, which are all decreasing for $\lambda >0$.
  • ...and 14 more figures