Table of Contents
Fetching ...

Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

Hong Nguyen, Sean Foley, Kevin Huang, Xuan Shi, Tiantian Feng, Shrikanth Narayanan

TL;DR

This work introduces a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech based on arbitrary audio or speech input and demonstrates that the visual generation significantly benefits from the pre-trained speech representations.

Abstract

Understanding speech production both visually and kinematically can inform second language learning system designs, as well as the creation of speaking characters in video games and animations. In this work, we introduce a data-driven method to visually represent articulator motion in Magnetic Resonance Imaging (MRI) videos of the human vocal tract during speech based on arbitrary audio or speech input. We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data using a speech-to-video diffusion model. Our findings demonstrate that the visual generation significantly benefits from the pre-trained speech representations. We also observed that evaluating phonemes in isolation is challenging but becomes more straightforward when assessed within the context of spoken words. Limitations of the current results include the presence of unsmooth tongue motion and video distortion when the tongue contacts the palate.

Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

TL;DR

This work introduces a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech based on arbitrary audio or speech input and demonstrates that the visual generation significantly benefits from the pre-trained speech representations.

Abstract

Understanding speech production both visually and kinematically can inform second language learning system designs, as well as the creation of speaking characters in video games and animations. In this work, we introduce a data-driven method to visually represent articulator motion in Magnetic Resonance Imaging (MRI) videos of the human vocal tract during speech based on arbitrary audio or speech input. We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data using a speech-to-video diffusion model. Our findings demonstrate that the visual generation significantly benefits from the pre-trained speech representations. We also observed that evaluating phonemes in isolation is challenging but becomes more straightforward when assessed within the context of spoken words. Limitations of the current results include the presence of unsmooth tongue motion and video distortion when the tongue contacts the palate.
Paper Structure (17 sections, 3 figures, 2 tables)

This paper contains 17 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The speech chain from higher-level linguistic representations to acoustic output. Our focus in this work is on the low-level articulation with the aim to generate vocal tract movements conditioned on acoustic prompts.
  • Figure 2: Overview of our speech-2-rtMRI Diffusion modeling framework for generating vocal tract movement video during speech. Our modeling framework includes two main phases: training and sampling.
  • Figure 3: Example cases of video quality degradation during generation. Left image show inauthentic tongue shapes while middle and right images show points of tongue-palate contact before quality degradation.