Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder

Yaxuan Li; Han Jiang; Yifei Ma; Shihua Qin; Jonghye Woo; Fangxu Xing

Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder

Yaxuan Li, Han Jiang, Yifei Ma, Shihua Qin, Jonghye Woo, Fangxu Xing

TL;DR

Dynamic MRI of the vocal tract is powerful but audio capture is hampered by scanner noise and data corruption. The authors present KE-CVAE, a two-step framework that first performs knowledge enhancement on unlabeled MRI data using a teacher–student ViT with self-supervised losses, then trains a conditional variational autoencoder to generate speech conditioned on the MRI sequence, aided by normalizing flows and adversarial training. The approach achieves higher Corr2D, PESQ, and MOS than CNN/Transformer baselines and exhibits robust ablation results showing the importance of each component. This work enables more accurate speech synthesis directly from dynamic MRI, with potential benefits for clinical diagnostics and speech motor research.

Abstract

Dynamic Magnetic Resonance Imaging (MRI) of the vocal tract has become an increasingly adopted imaging modality for speech motor studies. Beyond image signals, systematic data loss, noise pollution, and audio file corruption can occur due to the unpredictability of the MRI acquisition environment. In such cases, generating audio from images is critical for data recovery in both clinical and research applications. However, this remains challenging due to hardware constraints, acoustic interference, and data corruption. Existing solutions, such as denoising and multi-stage synthesis methods, face limitations in audio fidelity and generalizability. To address these challenges, we propose a Knowledge Enhanced Conditional Variational Autoencoder (KE-CVAE), a novel two-step "knowledge enhancement + variational inference" framework for generating speech audio signals from cine dynamic MRI sequences. This approach introduces two key innovations: (1) integration of unlabeled MRI data for knowledge enhancement, and (2) a variational inference architecture to improve generative modeling capacity. To the best of our knowledge, this is one of the first attempts at synthesizing speech audio directly from dynamic MRI video sequences. The proposed method was trained and evaluated on an open-source dynamic vocal tract MRI dataset recorded during speech. Experimental results demonstrate its effectiveness in generating natural speech waveforms while addressing MRI-specific acoustic challenges, outperforming conventional deep learning-based synthesis approaches.

Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder

TL;DR

Abstract

Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)