Table of Contents
Fetching ...

Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope

Leena G Pillai, D. Muhammad Noorul Mubarak

TL;DR

This paper surveys data-driven Acoustic-to-Articulatory Inversion (AAI) approaches developed between 2011 and 2021. It covers data sources including EMA, MRI, ultrasound, EPG, and X-ray cineradiography, and outlines the end-to-end workflow of feature extraction (e.g., MFCCs) and articulatory feature representations (e.g., Tract Variables), followed by training and evaluation using a spectrum of generative and discriminative methods. The review reports common evaluation metrics such as correlation coefficient, RMSE, MSE, and MFE, and highlights applications in speech analysis, synthesis, ASR incorporation, and phonetic/therapeutic feedback systems. It underscores the advantages of deep learning in improving articulatory reconstruction and the potential for interpretable, 3D articulatory feedback for therapy and language training. The authors also discuss practical data collection challenges and call for larger, accessible parallel datasets to advance SI-AAI and clinical deployments.

Abstract

This review is focused on the data-driven approaches applied in different applications of Acoustic-to-Articulatory Inversion (AAI) of speech. This review paper considered the relevant works published in the last ten years (2011-2021). The selection criteria includes (a) type of AAI - Speaker Dependent and Speaker Independent AAI, (b) objectives of the work - Articulatory approximation, Articulatory Feature space selection and Automatic Speech Recognition (ASR), explore the correlation between acoustic and articulatory features, and framework for Computer-assisted language training, (c) Corpus - Simultaneously recorded speech (wav) and medical imaging models such as ElectroMagnetic Articulography (EMA), Electropalatography (EPG), Laryngography, Electroglottography (EGG), X-ray Cineradiography, Ultrasound, and real-time Magnetic Resonance Imaging (rtMRI), (d) Methods or models - recent works are considered, and therefore all the works are based on machine learning, (e) Evaluation - as AAI is a non-linear regression problem, the performance evaluation is mostly done by Correlation Coefficient (CC), Root Mean Square Error (RMSE), and also considered Mean Square Error (MSE), and Mean Format Error (MFE). The practical application of the AAI model can provide a better and user-friendly interpretable image feedback system of articulatory positions, especially tongue movement. Such trajectory feedback system can be used to provide phonetic, language, and speech therapy for pathological subjects.

Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope

TL;DR

This paper surveys data-driven Acoustic-to-Articulatory Inversion (AAI) approaches developed between 2011 and 2021. It covers data sources including EMA, MRI, ultrasound, EPG, and X-ray cineradiography, and outlines the end-to-end workflow of feature extraction (e.g., MFCCs) and articulatory feature representations (e.g., Tract Variables), followed by training and evaluation using a spectrum of generative and discriminative methods. The review reports common evaluation metrics such as correlation coefficient, RMSE, MSE, and MFE, and highlights applications in speech analysis, synthesis, ASR incorporation, and phonetic/therapeutic feedback systems. It underscores the advantages of deep learning in improving articulatory reconstruction and the potential for interpretable, 3D articulatory feedback for therapy and language training. The authors also discuss practical data collection challenges and call for larger, accessible parallel datasets to advance SI-AAI and clinical deployments.

Abstract

This review is focused on the data-driven approaches applied in different applications of Acoustic-to-Articulatory Inversion (AAI) of speech. This review paper considered the relevant works published in the last ten years (2011-2021). The selection criteria includes (a) type of AAI - Speaker Dependent and Speaker Independent AAI, (b) objectives of the work - Articulatory approximation, Articulatory Feature space selection and Automatic Speech Recognition (ASR), explore the correlation between acoustic and articulatory features, and framework for Computer-assisted language training, (c) Corpus - Simultaneously recorded speech (wav) and medical imaging models such as ElectroMagnetic Articulography (EMA), Electropalatography (EPG), Laryngography, Electroglottography (EGG), X-ray Cineradiography, Ultrasound, and real-time Magnetic Resonance Imaging (rtMRI), (d) Methods or models - recent works are considered, and therefore all the works are based on machine learning, (e) Evaluation - as AAI is a non-linear regression problem, the performance evaluation is mostly done by Correlation Coefficient (CC), Root Mean Square Error (RMSE), and also considered Mean Square Error (MSE), and Mean Format Error (MFE). The practical application of the AAI model can provide a better and user-friendly interpretable image feedback system of articulatory positions, especially tongue movement. Such trajectory feedback system can be used to provide phonetic, language, and speech therapy for pathological subjects.

Paper Structure

This paper contains 9 sections, 8 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: From the Acoustic Features of the speech wave, the AAI estimate how the speech sounds are produced (Place and manner of the articulation).
  • Figure 2: Tract Variable Sivaraman2019