Table of Contents
Fetching ...

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

Ahmed Adel Attia, Yashish M. Siriwardena, Carol Espy-Wilson

TL;DR

The paper tackles improving acoustic-to-articulatory speech inversion by leveraging self-supervised HuBERT features and a novel geometric tract-variable transformation to produce output representations more correlated with acoustics. By combining SSL inputs with enhanced TV transformations, the study reports a significant increase in TV-estimation accuracy, with $PPMC$ rising from $0.7452$ to $0.8141$. It utilizes HuBERT-large embeddings extracted from 2-second segments within a BiGRNN SI framework and compares against MFCC baselines, demonstrating the greater impact of input representation over mere data quantity. The findings underscore the importance of rich input representations and output feature-space design for robust articulatory estimation and point to future work extending the TV model to capture more complete tongue dynamics.

Abstract

The performance of deep learning models depends significantly on their capacity to encode input features efficiently and decode them into meaningful outputs. Better input and output representation has the potential to boost models' performance and generalization. In the context of acoustic-to-articulatory speech inversion (SI) systems, we study the impact of utilizing speech representations acquired via self-supervised learning (SSL) models, such as HuBERT compared to conventional acoustic features. Additionally, we investigate the incorporation of novel tract variables (TVs) through an improved geometric transformation model. By combining these two approaches, we improve the Pearson product-moment correlation (PPMC) scores which evaluate the accuracy of TV estimation of the SI system from 0.7452 to 0.8141, a 6.9% increase. Our findings underscore the profound influence of rich feature representations from SSL models and improved geometric transformations with target TVs on the enhanced functionality of SI systems.

Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

TL;DR

The paper tackles improving acoustic-to-articulatory speech inversion by leveraging self-supervised HuBERT features and a novel geometric tract-variable transformation to produce output representations more correlated with acoustics. By combining SSL inputs with enhanced TV transformations, the study reports a significant increase in TV-estimation accuracy, with rising from to . It utilizes HuBERT-large embeddings extracted from 2-second segments within a BiGRNN SI framework and compares against MFCC baselines, demonstrating the greater impact of input representation over mere data quantity. The findings underscore the importance of rich input representations and output feature-space design for robust articulatory estimation and point to future work extending the TV model to capture more complete tongue dynamics.

Abstract

The performance of deep learning models depends significantly on their capacity to encode input features efficiently and decode them into meaningful outputs. Better input and output representation has the potential to boost models' performance and generalization. In the context of acoustic-to-articulatory speech inversion (SI) systems, we study the impact of utilizing speech representations acquired via self-supervised learning (SSL) models, such as HuBERT compared to conventional acoustic features. Additionally, we investigate the incorporation of novel tract variables (TVs) through an improved geometric transformation model. By combining these two approaches, we improve the Pearson product-moment correlation (PPMC) scores which evaluate the accuracy of TV estimation of the SI system from 0.7452 to 0.8141, a 6.9% increase. Our findings underscore the profound influence of rich feature representations from SSL models and improved geometric transformations with target TVs on the enhanced functionality of SI systems.
Paper Structure (16 sections, 6 equations, 3 figures, 1 table)

This paper contains 16 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Pellet placement and TV definition in the XRMB dataset
  • Figure 2: Extended Palateal Trace With the Anterior Pharyn- geal Wall For Speaker JW33
  • Figure 3: LA and constriction degree TVs for the utterance ‘The dormitory is between the house and the school’ estimated by the model trained with HuBERT embeddings (estimated_hubert) and the model trained with MFCCs (estimated_mfcc). Solid blue Line - ground truth, black dotted line - predictions by the HuBERT based model, yellow dotted Line - predictions by MFCC based model.