Table of Contents
Fetching ...

Simulating Articulatory Trajectories with Phonological Feature Interpolation

Angelo Ortiz Tandazo, Thomas Schatz, Thomas Hueber, Emmanuel Dupoux

TL;DR

This work probes forward mapping from pseudo-motor commands to articulatory trajectories within a speech perception-production loop, comparing generative phonology (GP) and articulatory phonology (AP) feature sets and testing multiple interpolation strategies to generate smooth trajectories. Evaluation uses linear probing to relate interpolated trajectories to EMA-derived articulatory data from the MOCHA-TIMIT corpus, reporting a maximum Pearson correlation of about $0.67$–$0.68$ for GP features with one-hot phoneme encodings and linear interpolation. The study finds that linear interpolation better captures articulatory dynamics than cubic splines, that unknown/context-dependent features often improve fit, and that incorporating under-specified dimensions can help model co-articulation. These results offer insights for integrating motor representations into SSL speech models and suggest avenues for leveraging underspecified phonological targets to reflect biological motion, with future work exploring why linear interpolation prevails and extending the framework to more contexts.

Abstract

As a first step towards a complete computational model of speech learning involving perception-production loops, we investigate the forward mapping between pseudo-motor commands and articulatory trajectories. Two phonological feature sets, based respectively on generative and articulatory phonology, are used to encode a phonetic target sequence. Different interpolation techniques are compared to generate smooth trajectories in these feature spaces, with a potential optimisation of the target value and timing to capture co-articulation effects. We report the Pearson correlation between a linear projection of the generated trajectories and articulatory data derived from a multi-speaker dataset of electromagnetic articulography (EMA) recordings. A correlation of 0.67 is obtained with an extended feature set based on generative phonology and a linear interpolation technique. We discuss the implications of our results for our understanding of the dynamics of biological motion.

Simulating Articulatory Trajectories with Phonological Feature Interpolation

TL;DR

This work probes forward mapping from pseudo-motor commands to articulatory trajectories within a speech perception-production loop, comparing generative phonology (GP) and articulatory phonology (AP) feature sets and testing multiple interpolation strategies to generate smooth trajectories. Evaluation uses linear probing to relate interpolated trajectories to EMA-derived articulatory data from the MOCHA-TIMIT corpus, reporting a maximum Pearson correlation of about for GP features with one-hot phoneme encodings and linear interpolation. The study finds that linear interpolation better captures articulatory dynamics than cubic splines, that unknown/context-dependent features often improve fit, and that incorporating under-specified dimensions can help model co-articulation. These results offer insights for integrating motor representations into SSL speech models and suggest avenues for leveraging underspecified phonological targets to reflect biological motion, with future work exploring why linear interpolation prevails and extending the framework to more contexts.

Abstract

As a first step towards a complete computational model of speech learning involving perception-production loops, we investigate the forward mapping between pseudo-motor commands and articulatory trajectories. Two phonological feature sets, based respectively on generative and articulatory phonology, are used to encode a phonetic target sequence. Different interpolation techniques are compared to generate smooth trajectories in these feature spaces, with a potential optimisation of the target value and timing to capture co-articulation effects. We report the Pearson correlation between a linear projection of the generated trajectories and articulatory data derived from a multi-speaker dataset of electromagnetic articulography (EMA) recordings. A correlation of 0.67 is obtained with an extended feature set based on generative phonology and a linear interpolation technique. We discuss the implications of our results for our understanding of the dynamics of biological motion.
Paper Structure (9 sections, 4 equations, 1 figure, 3 tables)

This paper contains 9 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Simplified diagram of a speech perception-production loop (to the left). The focus of this work lies in the forward model and the linear probing (to the right).