Table of Contents
Fetching ...

Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

Leena G Pillai, D. Muhammad Noorul Mubarak, Elizabeth Sherly

TL;DR

The paper tackles acoustic-to-articulatory inversion by predicting tongue and lip movements from speech acoustics using a stacked BiLSTM-CNN architecture, incorporating a fixed-weight smoothing layer. Trained on parallel EMA datasets (MOCHA and USC-TIMIT), the approach is evaluated across speaker-dependent, speaker-independent, corpus-dependent, and cross-corpus settings, showing that fixed-weight smoothing yields superior performance in most scenarios and faster convergence. Key contributions include the integration of a Windowed-Sinc low-pass smoothing module with fixed weights and a comprehensive multi-mode evaluation, highlighting both robustness in intra-corpus tasks and challenges in cross-corpus generalization. The work advances articulatory feature prediction and offers a practical framework for AAI with potential implications for speech science, pathology assessment, and speech technology applications, while identifying domain adaptation needs for cross-corpus transfer.

Abstract

Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.

Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

TL;DR

The paper tackles acoustic-to-articulatory inversion by predicting tongue and lip movements from speech acoustics using a stacked BiLSTM-CNN architecture, incorporating a fixed-weight smoothing layer. Trained on parallel EMA datasets (MOCHA and USC-TIMIT), the approach is evaluated across speaker-dependent, speaker-independent, corpus-dependent, and cross-corpus settings, showing that fixed-weight smoothing yields superior performance in most scenarios and faster convergence. Key contributions include the integration of a Windowed-Sinc low-pass smoothing module with fixed weights and a comprehensive multi-mode evaluation, highlighting both robustness in intra-corpus tasks and challenges in cross-corpus generalization. The work advances articulatory feature prediction and offers a practical framework for AAI with potential implications for speech science, pathology assessment, and speech technology applications, while identifying domain adaptation needs for cross-corpus transfer.

Abstract

Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.

Paper Structure

This paper contains 15 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Proposed BiLSTM-CNN Architecture for AAI
  • Figure 2: $TT_x$ EMA sensor value of a US speaker for uttering "This was easy for us"
  • Figure 3: $TT_x$ EMA sensor value of a UK speaker for uttering "This was easy for us"
  • Figure 4: The evaluation of the model in SD approach
  • Figure 5: Performance evaluation of each speaker in SI approach
  • ...and 3 more figures