Table of Contents
Fetching ...

ARTI-6: Towards Six-dimensional Articulatory Speech Encoding

Jihwan Lee, Sean Foley, Thanathai Lertpetchpun, Kevin Huang, Yoonjeong Lee, Tiantian Feng, Louis Goldstein, Dani Byrd, Shrikanth Narayanan

TL;DR

ARTI-6 addresses the need for an interpretable, low-dimensional articulatory representation by deriving a six-dimensional encoding from real-time MRI that covers six vocal-tract regions. It combines region selection with a foundation-model-based articulatory inversion (achieving $0.872$ correlation) and a HiFi-GAN–based articulatory synthesis system conditioned on speaker embeddings, demonstrating intelligible speech from compact articulatory features. The results show strong inversion performance for several ROIs and competitive intelligibility metrics (WER $0.125$, CER $0.074$, MOS $\approx 3.95$) on LibriTTS-R, validating the approach while highlighting trade-offs in naturalness. This framework offers interpretability, efficiency, and broad applicability for scientific studies and on-device speech technologies, with public code and samples to foster further development.

Abstract

We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications. The source code and speech samples are publicly available.

ARTI-6: Towards Six-dimensional Articulatory Speech Encoding

TL;DR

ARTI-6 addresses the need for an interpretable, low-dimensional articulatory representation by deriving a six-dimensional encoding from real-time MRI that covers six vocal-tract regions. It combines region selection with a foundation-model-based articulatory inversion (achieving correlation) and a HiFi-GAN–based articulatory synthesis system conditioned on speaker embeddings, demonstrating intelligible speech from compact articulatory features. The results show strong inversion performance for several ROIs and competitive intelligibility metrics (WER , CER , MOS ) on LibriTTS-R, validating the approach while highlighting trade-offs in naturalness. This framework offers interpretability, efficiency, and broad applicability for scientific studies and on-device speech technologies, with public code and samples to foster further development.

Abstract

We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications. The source code and speech samples are publicly available.

Paper Structure

This paper contains 15 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the ARTI-6 framework. The articulatory inversion and synthesis models, and the six key regions of interest (ROIs) of the six-dimensional articulatory features (right): Lip Aperture (LA), Tongue Tip (TT), Tongue Body (TB), Velum (VL), Tongue Root (TR), and Larynx (LX).
  • Figure 2: An example utterance of predicted and target articulatory features of ARTI-6. High prediction accuracy is achieved for the lip and tongue regions, whereas performance is comparatively lower for the velum and larynx regions.
  • Figure 3: Heatmap of prediction correlations with respect to the vocal tract ROIs and phonetic manner categories.