Table of Contents
Fetching ...

Self-Supervised Models of Speech Infer Universal Articulatory Kinematics

Cheol Jun Cho, Abdelrahman Mohamed, Alan W Black, Gopala K. Anumanchipalli

TL;DR

Self-Supervised speech models encode articulatory kinematics as a causal intermediate across languages. By probing models trained on multiple languages with a large EMA dataset, the authors show that a simple linear projection can recover articulatory trajectories with average correlations above $0.8$, independent of training language. They further demonstrate that individual articulatory subsystems are affine-transformable across speakers and languages, implying a canonical basis of articulatory kinematics embedded in SSL representations. The results support language-agnostic, interpretable Acoustic-to-Articulatory Inversion models and reduce EMA data requirements for exploring articulatory phonology.

Abstract

Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun "probing" models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show "inference of articulatory kinematics" as fundamental property of SSL models, i.e., the ability of these models to transform acoustics into the causal articulatory dynamics underlying the speech signal. We also show that this abstraction is largely overlapping across the language of the data used to train the model, with preference to the language with similar phonological system. Furthermore, we show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects, showing the generalizability of this property. Together, these results shed new light on the internals of SSL models that are critical to their superior performance, and open up new avenues into language-agnostic universal models for speech engineering, that are interpretable and grounded in speech science.

Self-Supervised Models of Speech Infer Universal Articulatory Kinematics

TL;DR

Self-Supervised speech models encode articulatory kinematics as a causal intermediate across languages. By probing models trained on multiple languages with a large EMA dataset, the authors show that a simple linear projection can recover articulatory trajectories with average correlations above , independent of training language. They further demonstrate that individual articulatory subsystems are affine-transformable across speakers and languages, implying a canonical basis of articulatory kinematics embedded in SSL representations. The results support language-agnostic, interpretable Acoustic-to-Articulatory Inversion models and reduce EMA data requirements for exploring articulatory phonology.

Abstract

Self-Supervised Learning (SSL) based models of speech have shown remarkable performance on a range of downstream tasks. These state-of-the-art models have remained blackboxes, but many recent studies have begun "probing" models like HuBERT, to correlate their internal representations to different aspects of speech. In this paper, we show "inference of articulatory kinematics" as fundamental property of SSL models, i.e., the ability of these models to transform acoustics into the causal articulatory dynamics underlying the speech signal. We also show that this abstraction is largely overlapping across the language of the data used to train the model, with preference to the language with similar phonological system. Furthermore, we show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects, showing the generalizability of this property. Together, these results shed new light on the internals of SSL models that are critical to their superior performance, and open up new avenues into language-agnostic universal models for speech engineering, that are interpretable and grounded in speech science.
Paper Structure (14 sections, 3 figures, 1 table)

This paper contains 14 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: (left) EMA prediction performance of probing SSL models from different languages. The correlations are averaged across 12 EMA channels. Each language-dialect group is denoted by a hatch pattern and each dot denotes an individual speaker. Regardless of the language, the average performance reaches over 0.8. (right) Performance comparison of English SSL versus Mandarin SSL, each panel shows specific language-dialect groups. The diagonal dashed lines denote identity lines. For English, native speakers (EN.UK/US) prefer English models over Mandarin models, scattered slightly above the diagonals, but speakers from China (EN.BJ/SH) show almost identical scores.
  • Figure 2: (far-left) Correlation matrix denoting transferabilities between language-dialect groups. The scores are averaged over possible pairs between groups. (mid-left) Distributions of correlations across dialects (blue) and within dialects (orange). The English speakers from China (EN.SH+EN.BJ) and the native English speakers from the US (EN.US) groups in the EMA-MAE dataset are used. (mid-right) Distributions of correlations across gender (blue) and within gender (orange). Four male and four female speakers from the HPRC dataset are used. (far-right) The average absolute coefficients in the affine transformations between the articulatory systems.
  • Figure 3: Transferability scores of each articulator averaged over all source-target pairs.