Table of Contents
Fetching ...

Acoustic-to-Articulatory Inversion of Clean Speech Using an MRI-Trained Model

Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie

Abstract

Articulatory acoustic inversion reconstructs vocal tract shapes from speech. Real-time magnetic resonance imaging (rt-MRI) allows simultaneous acquisition of both the acoustic speech signal and articulatory information. Besides the complexity of rt-MRI acquisition, the recorded audio is heavily corrupted by scanner noise and requires denoising to be usable. For practical use, it must be possible to invert speech recorded without MRI noise. In this study, we investigate the use of speech recorded in a clean acoustic environment as an alternative to denoised MRI speech. To this end we compare two signals from the same speaker with identical sentences which are aligned using phonetic segmentation. A model trained on denoised MRI speech is evaluated on both denoised MRI and clean speech. We also assess a model trained and tested only on clean speech. Results show that clean speech supports articulatory inversion effectively, achieving an RMSE of 1.56 mm, close to MRI-based performance.

Acoustic-to-Articulatory Inversion of Clean Speech Using an MRI-Trained Model

Abstract

Articulatory acoustic inversion reconstructs vocal tract shapes from speech. Real-time magnetic resonance imaging (rt-MRI) allows simultaneous acquisition of both the acoustic speech signal and articulatory information. Besides the complexity of rt-MRI acquisition, the recorded audio is heavily corrupted by scanner noise and requires denoising to be usable. For practical use, it must be possible to invert speech recorded without MRI noise. In this study, we investigate the use of speech recorded in a clean acoustic environment as an alternative to denoised MRI speech. To this end we compare two signals from the same speaker with identical sentences which are aligned using phonetic segmentation. A model trained on denoised MRI speech is evaluated on both denoised MRI and clean speech. We also assess a model trained and tested only on clean speech. Results show that clean speech supports articulatory inversion effectively, achieving an RMSE of 1.56 mm, close to MRI-based performance.
Paper Structure (14 sections, 7 equations, 3 figures, 2 tables)

This paper contains 14 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Comparison of normal speech, MRI-denoised speech from our dataset, and denoised speech from ramanarayanan2018analysis. The first two signals were produced by the same female speaker uttering “Après une heure.” All three audio files are provided as supplementary material.
  • Figure 2: Segmentation of articulators contour tracked in two images of the rt-MRI film: Arytenoid cartilage, Epiglottis, Lower lip, Pharyngeal wall, Soft palate midline, Tongue, Upper lip, Vocal folds
  • Figure 3: Model architecture