Table of Contents
Fetching ...

Confidence-Based Self-Training for EMG-to-Speech: Leveraging Synthetic EMG for Robust Modeling

Xiaodan Chen, Xiaoxue Gao, Mathias Quoy, Alexandre Pitti, Nancy F. Chen

TL;DR

This work tackles data scarcity in voiced EMG-to-Speech by introducing Confidence-based Multi-Speaker Self-training (CoM2S) and Libri-EMG, a time-aligned multi-speaker EMG dataset generated from LibriSpeech. By generating synthetic EMG conditioned on speaker-independent speech units and applying confidence-based phoneme filtering, the method selects high-quality pseudo-labeled data for self-training, achieving substantial gains in WER and phoneme accuracy. A 1:1 real-to-synthetic mix and a train-from-scratch regime demonstrate strong improvements, including cross-dataset generalization to Libri-EMG and enhanced subjective intelligibility and quality. These results suggest synthetic data, properly filtered, can robustly augment V-ETS models, reducing reliance on costly EMG collection and enabling more scalable, multi-speaker speech reconstruction from muscle activity; the authors also release Libri-EMG and code for future research.

Abstract

Voiced Electromyography (EMG)-to-Speech (V-ETS) models reconstruct speech from muscle activity signals, facilitating applications such as neurolaryngologic diagnostics. Despite its potential, the advancement of V-ETS is hindered by a scarcity of paired EMG-speech data. To address this, we propose a novel Confidence-based Multi-Speaker Self-training (CoM2S) approach, along with a newly curated Libri-EMG dataset. This approach leverages synthetic EMG data generated by a pre-trained model, followed by a proposed filtering mechanism based on phoneme-level confidence to enhance the ETS model through the proposed self-training techniques. Experiments demonstrate our method improves phoneme accuracy, reduces phonological confusion, and lowers word error rate, confirming the effectiveness of our CoM2S approach for V-ETS. In support of future research, we will release the codes and the proposed Libri-EMG dataset-an open-access, time-aligned, multi-speaker voiced EMG and speech recordings.

Confidence-Based Self-Training for EMG-to-Speech: Leveraging Synthetic EMG for Robust Modeling

TL;DR

This work tackles data scarcity in voiced EMG-to-Speech by introducing Confidence-based Multi-Speaker Self-training (CoM2S) and Libri-EMG, a time-aligned multi-speaker EMG dataset generated from LibriSpeech. By generating synthetic EMG conditioned on speaker-independent speech units and applying confidence-based phoneme filtering, the method selects high-quality pseudo-labeled data for self-training, achieving substantial gains in WER and phoneme accuracy. A 1:1 real-to-synthetic mix and a train-from-scratch regime demonstrate strong improvements, including cross-dataset generalization to Libri-EMG and enhanced subjective intelligibility and quality. These results suggest synthetic data, properly filtered, can robustly augment V-ETS models, reducing reliance on costly EMG collection and enabling more scalable, multi-speaker speech reconstruction from muscle activity; the authors also release Libri-EMG and code for future research.

Abstract

Voiced Electromyography (EMG)-to-Speech (V-ETS) models reconstruct speech from muscle activity signals, facilitating applications such as neurolaryngologic diagnostics. Despite its potential, the advancement of V-ETS is hindered by a scarcity of paired EMG-speech data. To address this, we propose a novel Confidence-based Multi-Speaker Self-training (CoM2S) approach, along with a newly curated Libri-EMG dataset. This approach leverages synthetic EMG data generated by a pre-trained model, followed by a proposed filtering mechanism based on phoneme-level confidence to enhance the ETS model through the proposed self-training techniques. Experiments demonstrate our method improves phoneme accuracy, reduces phonological confusion, and lowers word error rate, confirming the effectiveness of our CoM2S approach for V-ETS. In support of future research, we will release the codes and the proposed Libri-EMG dataset-an open-access, time-aligned, multi-speaker voiced EMG and speech recordings.

Paper Structure

This paper contains 21 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Top left: Overview of our CoM2S approach for V-ETS. We employ a GAN-based EMG generator scheck23_interspeech conditioned on speaker-independent Soft Speech Units generated from HuBERT Encoder van_Niekerk_2022, with learnable session embeddings accounting for variations in electrode configurations. The generated EMG data then undergoes preprocessing, including upsampling and inverse transformation, to align with real EMG signals as described in Sec. \ref{['generation_libriEMG']}. A pretrained transduction model together with the pretrained classifier gaddy2021improvedmodelvoicingsilent serves as the teacher model, filtering synthetic samples based on phoneme accuracy. Only high-confidence synthetic data is retained and proportionally mixed with real EMG data for self-training, ensuring robust adaptation while maintaining phonetic consistency. Top right: baseline transduction model architecture gaddy-klein-2020-digitalgaddy2021improvedmodelvoicingsilent. Bottom: inference pipeline.
  • Figure 2: Performance comparison of EMG-based speech recognition models trained on different filtered subsets of self-generated data (5.4h dev-clean in LibriSpeech Librispeech) and evaluated on corresponding test sets. Values and colors represent word error rates (WER) (lower/lighter is better).
  • Figure 3: The evaluation results of WER, phoneme confusion and phoneme accuracy across different real-to-synthetic data ratios.