ARTI-6: Towards Six-dimensional Articulatory Speech Encoding

Jihwan Lee; Sean Foley; Thanathai Lertpetchpun; Kevin Huang; Yoonjeong Lee; Tiantian Feng; Louis Goldstein; Dani Byrd; Shrikanth Narayanan

ARTI-6: Towards Six-dimensional Articulatory Speech Encoding

Jihwan Lee, Sean Foley, Thanathai Lertpetchpun, Kevin Huang, Yoonjeong Lee, Tiantian Feng, Louis Goldstein, Dani Byrd, Shrikanth Narayanan

TL;DR

ARTI-6 addresses the need for an interpretable, low-dimensional articulatory representation by deriving a six-dimensional encoding from real-time MRI that covers six vocal-tract regions. It combines region selection with a foundation-model-based articulatory inversion (achieving $0.872$ correlation) and a HiFi-GAN–based articulatory synthesis system conditioned on speaker embeddings, demonstrating intelligible speech from compact articulatory features. The results show strong inversion performance for several ROIs and competitive intelligibility metrics (WER $0.125$, CER $0.074$, MOS $\approx 3.95$) on LibriTTS-R, validating the approach while highlighting trade-offs in naturalness. This framework offers interpretability, efficiency, and broad applicability for scientific studies and on-device speech technologies, with public code and samples to foster further development.

Abstract

We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications. The source code and speech samples are publicly available.

ARTI-6: Towards Six-dimensional Articulatory Speech Encoding

TL;DR

Abstract

ARTI-6: Towards Six-dimensional Articulatory Speech Encoding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)