Acoustic-to-Articulatory Inversion of Clean Speech Using an MRI-Trained Model

Sofiane Azzouz; Pierre-André Vuissoz; Yves Laprie

Acoustic-to-Articulatory Inversion of Clean Speech Using an MRI-Trained Model

Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie

Abstract

Articulatory acoustic inversion reconstructs vocal tract shapes from speech. Real-time magnetic resonance imaging (rt-MRI) allows simultaneous acquisition of both the acoustic speech signal and articulatory information. Besides the complexity of rt-MRI acquisition, the recorded audio is heavily corrupted by scanner noise and requires denoising to be usable. For practical use, it must be possible to invert speech recorded without MRI noise. In this study, we investigate the use of speech recorded in a clean acoustic environment as an alternative to denoised MRI speech. To this end we compare two signals from the same speaker with identical sentences which are aligned using phonetic segmentation. A model trained on denoised MRI speech is evaluated on both denoised MRI and clean speech. We also assess a model trained and tested only on clean speech. Results show that clean speech supports articulatory inversion effectively, achieving an RMSE of 1.56 mm, close to MRI-based performance.

Acoustic-to-Articulatory Inversion of Clean Speech Using an MRI-Trained Model

Abstract

Paper Structure (14 sections, 7 equations, 3 figures, 2 tables)

This paper contains 14 sections, 7 equations, 3 figures, 2 tables.

Introduction
Impact of MRI noise and denoising
Dataset
Pre processing
Methods
Alignment of MRI and Clean Speech Corpora
Model Architecture
Loss function
Evaluation of the model
Experiments
Model parameters
Results
Discussion
Conclusion

Figures (3)

Figure 1: Comparison of normal speech, MRI-denoised speech from our dataset, and denoised speech from ramanarayanan2018analysis. The first two signals were produced by the same female speaker uttering “Après une heure.” All three audio files are provided as supplementary material.
Figure 2: Segmentation of articulators contour tracked in two images of the rt-MRI film: Arytenoid cartilage, Epiglottis, Lower lip, Pharyngeal wall, Soft palate midline, Tongue, Upper lip, Vocal folds
Figure 3: Model architecture

Acoustic-to-Articulatory Inversion of Clean Speech Using an MRI-Trained Model

Abstract

Acoustic-to-Articulatory Inversion of Clean Speech Using an MRI-Trained Model

Authors

Abstract

Table of Contents

Figures (3)