Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

Hong Nguyen; Sean Foley; Kevin Huang; Xuan Shi; Tiantian Feng; Shrikanth Narayanan

Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

Hong Nguyen, Sean Foley, Kevin Huang, Xuan Shi, Tiantian Feng, Shrikanth Narayanan

TL;DR

This work introduces a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech based on arbitrary audio or speech input and demonstrates that the visual generation significantly benefits from the pre-trained speech representations.

Abstract

Understanding speech production both visually and kinematically can inform second language learning system designs, as well as the creation of speaking characters in video games and animations. In this work, we introduce a data-driven method to visually represent articulator motion in Magnetic Resonance Imaging (MRI) videos of the human vocal tract during speech based on arbitrary audio or speech input. We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data using a speech-to-video diffusion model. Our findings demonstrate that the visual generation significantly benefits from the pre-trained speech representations. We also observed that evaluating phonemes in isolation is challenging but becomes more straightforward when assessed within the context of spoken words. Limitations of the current results include the presence of unsmooth tongue motion and video distortion when the tongue contacts the palate.

Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

TL;DR

Abstract

Paper Structure (17 sections, 3 figures, 2 tables)

This paper contains 17 sections, 3 figures, 2 tables.

Introduction
Speech-to-rtMRI Diffusion Model
Overview
Training
Sampling
Experiments
Dataset
Experiment Settings
Evaluation Metrics
Motion fidelity is as important as visual fidelity in medical domain
Human evaluation
Discussion
Synthetic phonemes are harder to assess than words
Automatic Synthetic phoneme/word/sentence evaluation metric
Differential weighting of the articulators
...and 2 more sections

Figures (3)

Figure 1: The speech chain from higher-level linguistic representations to acoustic output. Our focus in this work is on the low-level articulation with the aim to generate vocal tract movements conditioned on acoustic prompts.
Figure 2: Overview of our speech-2-rtMRI Diffusion modeling framework for generating vocal tract movement video during speech. Our modeling framework includes two main phases: training and sampling.
Figure 3: Example cases of video quality degradation during generation. Left image show inauthentic tongue shapes while middle and right images show points of tongue-palate contact before quality degradation.

Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

TL;DR

Abstract

Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (3)