Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

Kwanghee Choi; Eunjung Yeo; Cheol Jun Cho; David R. Mortensen; David Harwath

Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho, David R. Mortensen, David Harwath

Abstract

Transformer-based self-supervised speech models (S3Ms) are often described as contextualized, yet what this entails remains unclear. Here, we focus on how a single frame-level S3M representation can encode phones and their surrounding context. Prior work has shown that S3Ms represent phones compositionally; for example, phonological vectors such as voicing, bilabiality, and nasality vectors are superposed in the S3M representation of [m]. We extend this view by proposing that phonological information from a sequence of neighboring phones is also compositionally encoded in a single frame, such that vectors corresponding to previous, current, and next phones are superposed within a single frame-level representation. We show that this structure has several properties, including orthogonality between relative positions, and emergence of implicit phonetic boundaries. Together, our findings advance our understanding of context-dependent S3M representations.

Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

Abstract

Paper Structure (18 sections, 3 equations, 10 figures)

This paper contains 18 sections, 3 equations, 10 figures.

Settings
Self-supervised Speech Models (S3Ms)
S3M Analysis via Phonological Analogies
Datasets
Experiments
Frame-level compositionality
Contextual phonological vectors
How much neighboring context does the center frame encode?
Effective window size for phonological analogies
Positional orthogonality
Phonetic segmentation
Discussions
Qualitative example of phonological vectors
Layerwise mask-filling behavior
Connection with Observations from Previous Works
...and 3 more sections

Figures (10)

Figure 2: Phonological analogy success rates comparing mean and center pooling on TIMIT (above) and VoxAngeles (below). Center pooling is on par and often outperforms mean-pooling, indicating that phonological compositionality is present at the level of individual frame representations.
Figure 3: Phonological analogy success rates for probing contextual information encoded in a single frame-level representation of $p^{0}$ on TIMIT (upper) and VoxAngeles (lower). Center-pooled S3M representations support phonological analogies for the current phone position $(0)$ and its neighbors $(\pm 1)$.
Figure 4: Phonological analogy success rate for center phone $p^{0}$ with respect to relative position on TIMIT (upper) and VoxAngeles (lower). WavLM exhibits the widest window, with high success rates within the center phone position ${0}$, decreasing for $\pm1$, and near-zero for $\pm2$. Spectral representations, unlike S3Ms, show nonzero success rates only near the center.
Figure 5: Cosine similarity between phonological vectors extracted from frame-level S3M representations on TIMIT (upper) and VoxAngeles (lower). The structure mirrors that of choi2026self, with opposing features showing negative similarity and related features showing positive similarity.
Figure 6: Cosine similarity between phonological vectors associated with different relative phone positions ($-2$ to $+2$) from TIMIT (upper) and VoxAngeles (lower). Comparing with \ref{['fig:phonovectors']}, relative similarity structure is preserved within positions. Further, vectors from different positions exhibit substantially lower similarity than those from the same position, implying approximate positional orthogonality.
...and 5 more figures

Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

Abstract

Self-Supervised Speech Models Encode Phonetic Context via Position-dependent Orthogonal Subspaces

Authors

Abstract

Table of Contents

Figures (10)