Table of Contents
Fetching ...

Articulatory Feature Prediction from Surface EMG during Speech Production

Jihwan Lee, Kevin Huang, Kleanthis Avramidis, Simon Pistrosch, Monica Gonzalez-Machorro, Yoonjeong Lee, Björn Schuller, Louis Goldstein, Shrikanth Narayanan

TL;DR

The paper addresses decoding speech from surface EMG by predicting articulatory features that map muscle activity to articulatory movements and acoustic representations. It introduces a transformer-augmented EMG encoder with separate EMA, pitch, loudness, and phoneme predictors, followed by articulatory synthesis to reconstruct intelligible speech. Key contributions include high EMA/loudness prediction accuracy ($r\approx0.9$), moderate pitch prediction ($r\approx0.6$), comparable PER/WER to prior methods, improved SpeechBERTScore, and a knowledge-driven electrode-configuration framework enabling effective electrode subset selection. The approach enhances interpretability and practical deployment potential for EMG-based speech interfaces, while transparently outlining limitations due to reliance on acoustics-derived articulatory features and a single-speaker dataset.

Abstract

We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms. To our knowledge, this is the first method to decode speech waveforms from surface EMG via articulatory features, offering a novel approach to EMG-based speech synthesis. Additionally, we analyze the relationship between EMG electrode placement and articulatory feature predictability, providing knowledge-driven insights for optimizing EMG electrode configurations. The source code and decoded speech samples are publicly available.

Articulatory Feature Prediction from Surface EMG during Speech Production

TL;DR

The paper addresses decoding speech from surface EMG by predicting articulatory features that map muscle activity to articulatory movements and acoustic representations. It introduces a transformer-augmented EMG encoder with separate EMA, pitch, loudness, and phoneme predictors, followed by articulatory synthesis to reconstruct intelligible speech. Key contributions include high EMA/loudness prediction accuracy (), moderate pitch prediction (), comparable PER/WER to prior methods, improved SpeechBERTScore, and a knowledge-driven electrode-configuration framework enabling effective electrode subset selection. The approach enhances interpretability and practical deployment potential for EMG-based speech interfaces, while transparently outlining limitations due to reliance on acoustics-derived articulatory features and a single-speaker dataset.

Abstract

We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms. To our knowledge, this is the first method to decode speech waveforms from surface EMG via articulatory features, offering a novel approach to EMG-based speech synthesis. Additionally, we analyze the relationship between EMG electrode placement and articulatory feature predictability, providing knowledge-driven insights for optimizing EMG electrode configurations. The source code and decoded speech samples are publicly available.

Paper Structure

This paper contains 15 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Visualization of EMG electrode placement.
  • Figure 2: Six EMA sensor locations: Upper Lip (UL, red), Lower Lip (LL, yellow), Lower Incisor (LI, gray), Tongue Tip (TT, green), Tongue Body (TB, blue), and Tongue Dorsum (TD, purple).
  • Figure 3: Overall architecture of the proposed framework. The EMG encoder processes input EMG signals using convolutional blocks and a six-layer Transformer block. Its output is fed into four predictors, EMA, pitch, loudness, and auxiliary phoneme predictors, each trained to estimate the corresponding articulatory feature.
  • Figure 4: Pearson correlation between predicted articulatory features from EMG signals and the target articulatory features, estimated from audio recordings, along with 95% confidence intervals.
  • Figure 5: Correlation drop rate for each EMA sensor location when one EMG electrode is removed (left) and when only one EMG electrode is used (right). Warmer (yellow) colors indicate a stronger association between the corresponding EMG electrode and EMA sensor location. Note that the color bars are inverted between the two figure panels to enhance readability.