Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

Ahmed Adel Attia; Yashish M. Siriwardena; Carol Espy-Wilson

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

Ahmed Adel Attia, Yashish M. Siriwardena, Carol Espy-Wilson

TL;DR

The paper tackles improving acoustic-to-articulatory speech inversion by leveraging self-supervised HuBERT features and a novel geometric tract-variable transformation to produce output representations more correlated with acoustics. By combining SSL inputs with enhanced TV transformations, the study reports a significant increase in TV-estimation accuracy, with $PPMC$ rising from $0.7452$ to $0.8141$. It utilizes HuBERT-large embeddings extracted from 2-second segments within a BiGRNN SI framework and compares against MFCC baselines, demonstrating the greater impact of input representation over mere data quantity. The findings underscore the importance of rich input representations and output feature-space design for robust articulatory estimation and point to future work extending the TV model to capture more complete tongue dynamics.

Abstract

The performance of deep learning models depends significantly on their capacity to encode input features efficiently and decode them into meaningful outputs. Better input and output representation has the potential to boost models' performance and generalization. In the context of acoustic-to-articulatory speech inversion (SI) systems, we study the impact of utilizing speech representations acquired via self-supervised learning (SSL) models, such as HuBERT compared to conventional acoustic features. Additionally, we investigate the incorporation of novel tract variables (TVs) through an improved geometric transformation model. By combining these two approaches, we improve the Pearson product-moment correlation (PPMC) scores which evaluate the accuracy of TV estimation of the SI system from 0.7452 to 0.8141, a 6.9% increase. Our findings underscore the profound influence of rich feature representations from SSL models and improved geometric transformations with target TVs on the enhanced functionality of SI systems.

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

TL;DR

rising from

. It utilizes HuBERT-large embeddings extracted from 2-second segments within a BiGRNN SI framework and compares against MFCC baselines, demonstrating the greater impact of input representation over mere data quantity. The findings underscore the importance of rich input representations and output feature-space design for robust articulatory estimation and point to future work extending the TV model to capture more complete tongue dynamics.

Abstract

Paper Structure (16 sections, 6 equations, 3 figures, 1 table)

This paper contains 16 sections, 6 equations, 3 figures, 1 table.

Introduction
Articulatory Dataset
Novel Tract Variable Transformations
Articulatory Model
Lips
Tongue Body
Tongue Tip
Speech Inversion Model Architectures
SI Architecture with HuBERT features
SI Architecture with MFCC features
Model Training
Results
TV Transformations
SSL features with new TV transformations
Estimated TVs with best performing SI systems
...and 1 more sections

Figures (3)

Figure 1: Pellet placement and TV definition in the XRMB dataset
Figure 2: Extended Palateal Trace With the Anterior Pharyn- geal Wall For Speaker JW33
Figure 3: LA and constriction degree TVs for the utterance ‘The dormitory is between the house and the school’ estimated by the model trained with HuBERT embeddings (estimated_hubert) and the model trained with MFCCs (estimated_mfcc). Solid blue Line - ground truth, black dotted line - predictions by the HuBERT based model, yellow dotted Line - predictions by MFCC based model.

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

TL;DR

Abstract

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

Authors

TL;DR

Abstract

Table of Contents

Figures (3)