Table of Contents
Fetching ...

emg2speech: synthesizing speech from electromyography using self-supervised speech models

Harshavardhana T. Gowda, Lee M. Miller

TL;DR

The paper introduces emg2speech, a non-invasive EMG-to-speech system that leverages self-supervised speech representations to bridge muscle activity and audio. A key insight is that SS features relate linearly to EMG power, and that SS spaces encode articulatory structure, enabling an end-to-end pipeline that maps EMG signals to SS units and synthesizes speech via a pretrained vocoder without explicit articulatory modeling. The authors build a large, open dataset of 9 hours with ~6,800 words and demonstrate that EMG covariance features not only encode articulatory information but predict SS representations with high fidelity ($r$ up to $0.85$), supporting efficient forward mappings. The resulting emg2speech framework achieves alignment-free, end-to-end EMG-to-audio generation and points toward few-shot/zero-shot adaptability and practical non-invasive speech prosthetics, with ongoing work targeting phoneme-guided decoding and perceptual metrics.

Abstract

We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio. We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the electrical power of muscle action potentials: SS features can be linearly mapped to EMG power with a correlation of $r = 0.85$. Moreover, EMG power vectors corresponding to different articulatory gestures form structured and separable clusters in feature space. This relationship: $\text{SS features}$ $\xrightarrow{\texttt{linear mapping}}$ $\text{EMG power}$ $\xrightarrow{\texttt{gesture-specific clustering}}$ $\text{articulatory movements}$, highlights that SS models implicitly encode articulatory mechanisms. Leveraging this property, we directly map EMG signals to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory models and vocoder training.

emg2speech: synthesizing speech from electromyography using self-supervised speech models

TL;DR

The paper introduces emg2speech, a non-invasive EMG-to-speech system that leverages self-supervised speech representations to bridge muscle activity and audio. A key insight is that SS features relate linearly to EMG power, and that SS spaces encode articulatory structure, enabling an end-to-end pipeline that maps EMG signals to SS units and synthesizes speech via a pretrained vocoder without explicit articulatory modeling. The authors build a large, open dataset of 9 hours with ~6,800 words and demonstrate that EMG covariance features not only encode articulatory information but predict SS representations with high fidelity ( up to ), supporting efficient forward mappings. The resulting emg2speech framework achieves alignment-free, end-to-end EMG-to-audio generation and points toward few-shot/zero-shot adaptability and practical non-invasive speech prosthetics, with ongoing work targeting phoneme-guided decoding and perceptual metrics.

Abstract

We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio. We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the electrical power of muscle action potentials: SS features can be linearly mapped to EMG power with a correlation of . Moreover, EMG power vectors corresponding to different articulatory gestures form structured and separable clusters in feature space. This relationship: , highlights that SS models implicitly encode articulatory mechanisms. Leveraging this property, we directly map EMG signals to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory models and vocoder training.

Paper Structure

This paper contains 12 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Left: Electrode placement on the left side of the neck. Middle: Electrode placement on the right side of the neck. Right: Electrode placement on the left cheek.
  • Figure 2: Different orofacial gestures are naturally separable. t-SNE visualization of vectors $\mathbb{D}(\mathcal{E})$ corresponding to 13 orofacial movements for a single subject. The embedding is color-coded by gesture type ( a.u. = arbitrary units).
  • Figure 3: Layer-wise correlation ($r$) between $\mathbb{D}(\mathcal{E})$ and $\mathcal{H}$ across different self-supervised speech models. A simple linear mapping is used to predict $\mathbb{D}(\mathcal{E})$ from $\mathcal{H}$.
  • Figure 4: Layer-wise correlation ($r$) between $\mathcal{B}$ and $\mathcal{H}$ across different self-supervised speech models. A simple linear mapping is used to predict $\mathcal{B}$ from $\mathcal{H}$.
  • Figure 5: Multivariate EMG signals are converted into $\texttt{vec}(\mathcal{E})$, $\mathbb{D}(\mathcal{E})$, or $\mathcal{B}$, and then passed through a TDS Conv block to predict $\texttt{dis}(\mathcal{H})_{\textsc{HuBERT}}$, which are subsequently fed into a vocoder to synthesize audio. Frozen neural network components are shown in blue, and trainable components are shown in orange.