Table of Contents
Fetching ...

Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator

Woo-Jin Chung, Hong-Goo Kang

TL;DR

This work uses representations from a pre-trained self-supervised learning model to more effectively estimate the global, local, and kinematic pattern information in Electromagnetic Articulography signals during the AAI process, overcoming the limitations observed in conventional AAI models that rely on acoustic features derived from restricted datasets.

Abstract

We present a novel speaker-independent acoustic-to-articulatory inversion (AAI) model, overcoming the limitations observed in conventional AAI models that rely on acoustic features derived from restricted datasets. To address these challenges, we leverage representations from a pre-trained self-supervised learning (SSL) model to more effectively estimate the global, local, and kinematic pattern information in Electromagnetic Articulography (EMA) signals during the AAI process. We train our model using an adversarial approach and introduce an attention-based Multi-duration phoneme discriminator (MDPD) designed to fully capture the intricate relationship among multi-channel articulatory signals. Our method achieves a Pearson correlation coefficient of 0.847, marking state-of-the-art performance in speaker-independent AAI models. The implementation details and code can be found online.

Speaker-Independent Acoustic-to-Articulatory Inversion through Multi-Channel Attention Discriminator

TL;DR

This work uses representations from a pre-trained self-supervised learning model to more effectively estimate the global, local, and kinematic pattern information in Electromagnetic Articulography signals during the AAI process, overcoming the limitations observed in conventional AAI models that rely on acoustic features derived from restricted datasets.

Abstract

We present a novel speaker-independent acoustic-to-articulatory inversion (AAI) model, overcoming the limitations observed in conventional AAI models that rely on acoustic features derived from restricted datasets. To address these challenges, we leverage representations from a pre-trained self-supervised learning (SSL) model to more effectively estimate the global, local, and kinematic pattern information in Electromagnetic Articulography (EMA) signals during the AAI process. We train our model using an adversarial approach and introduce an attention-based Multi-duration phoneme discriminator (MDPD) designed to fully capture the intricate relationship among multi-channel articulatory signals. Our method achieves a Pearson correlation coefficient of 0.847, marking state-of-the-art performance in speaker-independent AAI models. The implementation details and code can be found online.

Paper Structure

This paper contains 10 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the proposed model.
  • Figure 2: Illustration of PNP convolution module.
  • Figure 3: Illustration of depthwise PNP-convolution. Here, $a$ denotes the pre-fixed constant influencing the frequency range emphasized by the Snake activation function. DConv signifies the depthwise convolution operation.
  • Figure 4: Illustration of the EMA reshaping process for sub-phoneme discriminators. The left figure illustrates the EMA signals with $C$ EMA traces and a length of $T$. The middle figure shows the EMA signals after the addition of channel split embeddings ($e_{cs}$) and channel end embedding ($e_{ce}$). The Right figure displays the EMA signals for the MDPD input, reshaped to have a channel size corresponding to the phoneme duration ($t_{pd}$) and a length of $C_D\cdot(T/t_{pd})$.