Confidence-Based Self-Training for EMG-to-Speech: Leveraging Synthetic EMG for Robust Modeling
Xiaodan Chen, Xiaoxue Gao, Mathias Quoy, Alexandre Pitti, Nancy F. Chen
TL;DR
This work tackles data scarcity in voiced EMG-to-Speech by introducing Confidence-based Multi-Speaker Self-training (CoM2S) and Libri-EMG, a time-aligned multi-speaker EMG dataset generated from LibriSpeech. By generating synthetic EMG conditioned on speaker-independent speech units and applying confidence-based phoneme filtering, the method selects high-quality pseudo-labeled data for self-training, achieving substantial gains in WER and phoneme accuracy. A 1:1 real-to-synthetic mix and a train-from-scratch regime demonstrate strong improvements, including cross-dataset generalization to Libri-EMG and enhanced subjective intelligibility and quality. These results suggest synthetic data, properly filtered, can robustly augment V-ETS models, reducing reliance on costly EMG collection and enabling more scalable, multi-speaker speech reconstruction from muscle activity; the authors also release Libri-EMG and code for future research.
Abstract
Voiced Electromyography (EMG)-to-Speech (V-ETS) models reconstruct speech from muscle activity signals, facilitating applications such as neurolaryngologic diagnostics. Despite its potential, the advancement of V-ETS is hindered by a scarcity of paired EMG-speech data. To address this, we propose a novel Confidence-based Multi-Speaker Self-training (CoM2S) approach, along with a newly curated Libri-EMG dataset. This approach leverages synthetic EMG data generated by a pre-trained model, followed by a proposed filtering mechanism based on phoneme-level confidence to enhance the ETS model through the proposed self-training techniques. Experiments demonstrate our method improves phoneme accuracy, reduces phonological confusion, and lowers word error rate, confirming the effectiveness of our CoM2S approach for V-ETS. In support of future research, we will release the codes and the proposed Libri-EMG dataset-an open-access, time-aligned, multi-speaker voiced EMG and speech recordings.
