emg2speech: synthesizing speech from electromyography using self-supervised speech models
Harshavardhana T. Gowda, Lee M. Miller
TL;DR
The paper introduces emg2speech, a non-invasive EMG-to-speech system that leverages self-supervised speech representations to bridge muscle activity and audio. A key insight is that SS features relate linearly to EMG power, and that SS spaces encode articulatory structure, enabling an end-to-end pipeline that maps EMG signals to SS units and synthesizes speech via a pretrained vocoder without explicit articulatory modeling. The authors build a large, open dataset of 9 hours with ~6,800 words and demonstrate that EMG covariance features not only encode articulatory information but predict SS representations with high fidelity ($r$ up to $0.85$), supporting efficient forward mappings. The resulting emg2speech framework achieves alignment-free, end-to-end EMG-to-audio generation and points toward few-shot/zero-shot adaptability and practical non-invasive speech prosthetics, with ongoing work targeting phoneme-guided decoding and perceptual metrics.
Abstract
We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio. We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the electrical power of muscle action potentials: SS features can be linearly mapped to EMG power with a correlation of $r = 0.85$. Moreover, EMG power vectors corresponding to different articulatory gestures form structured and separable clusters in feature space. This relationship: $\text{SS features}$ $\xrightarrow{\texttt{linear mapping}}$ $\text{EMG power}$ $\xrightarrow{\texttt{gesture-specific clustering}}$ $\text{articulatory movements}$, highlights that SS models implicitly encode articulatory mechanisms. Leveraging this property, we directly map EMG signals to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory models and vocoder training.
