Table of Contents
Fetching ...

Deep Speech Synthesis from Multimodal Articulatory Representations

Peter Wu, Bohan Yu, Kevin Scheck, Alan W Black, Aditi S. Krishnapriyan, Irene Y. Chen, Tanja Schultz, Shinji Watanabe, Gopala K. Anumanchipalli

TL;DR

This work demonstrates a wearable dry-electrode neckband for EMG-based speech decoding, addressing comfort and practicality over traditional facial/wet-electrode setups. Through multi-channel neck data (and facial benchmarks), it achieves high word-classification accuracy and provides phonological analyses confirming robust vowel decoding with neck data. The study also shows a substantive, but partial, linkage between self-supervised acoustic representations and EMG signals, suggesting exploitable cross-modal information for speech synthesis. Overall, the neckband approach offers a promising path toward practical, wearable EMG-to-speech systems suitable for assistive communication, with potential for multimodal integration and future sentence-length tasks.

Abstract

The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.

Deep Speech Synthesis from Multimodal Articulatory Representations

TL;DR

This work demonstrates a wearable dry-electrode neckband for EMG-based speech decoding, addressing comfort and practicality over traditional facial/wet-electrode setups. Through multi-channel neck data (and facial benchmarks), it achieves high word-classification accuracy and provides phonological analyses confirming robust vowel decoding with neck data. The study also shows a substantive, but partial, linkage between self-supervised acoustic representations and EMG signals, suggesting exploitable cross-modal information for speech synthesis. Overall, the neckband approach offers a promising path toward practical, wearable EMG-to-speech systems suitable for assistive communication, with potential for multimodal integration and future sentence-length tasks.

Abstract

The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.

Paper Structure

This paper contains 14 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: (a) Experimental setup including a dry electrode neckband, baseline monitoring face electrodes, wet reference electrode behind the right ear, and neckworn electronics behind the head. (b) Partial photograph of 3D printed, gold plated neck electrodes. (c) Sample renders of the experiment GUI's subject and host views. Subject view displays a teleprompter while raw EMG data is live plotted on the host view. (d) Raw sample EMG from a single utterance of the words 'Heed' and 'Kale'. (d) Sample EMG time-frequency spectrograms (see section 3.2) from a single utterance of the words 'Heed' and 'Kale'.
  • Figure 2: Classification accuracy for different numbers of neck electrodes. Solid lines are means and opaque regions are 95% confidence intervals.
  • Figure 3: Confusion matrices using model trained on (a) the 10 neck channels and (b) all 13 channels.
  • Figure 4: Weighted sum of self-supervised speech features match EMG spectrogram frequency bins. Here, we plot 1 EMG channel of a "kale" utterance for bins 90-94 Hz, 102-105 Hz, 238-242 Hz, and 348-352 Hz (Top-to-bottom).