Deep Speech Synthesis from Multimodal Articulatory Representations

Peter Wu; Bohan Yu; Kevin Scheck; Alan W Black; Aditi S. Krishnapriyan; Irene Y. Chen; Tanja Schultz; Shinji Watanabe; Gopala K. Anumanchipalli

Deep Speech Synthesis from Multimodal Articulatory Representations

Peter Wu, Bohan Yu, Kevin Scheck, Alan W Black, Aditi S. Krishnapriyan, Irene Y. Chen, Tanja Schultz, Shinji Watanabe, Gopala K. Anumanchipalli

TL;DR

This work demonstrates a wearable dry-electrode neckband for EMG-based speech decoding, addressing comfort and practicality over traditional facial/wet-electrode setups. Through multi-channel neck data (and facial benchmarks), it achieves high word-classification accuracy and provides phonological analyses confirming robust vowel decoding with neck data. The study also shows a substantive, but partial, linkage between self-supervised acoustic representations and EMG signals, suggesting exploitable cross-modal information for speech synthesis. Overall, the neckband approach offers a promising path toward practical, wearable EMG-to-speech systems suitable for assistive communication, with potential for multimodal integration and future sentence-length tasks.

Abstract

The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.

Deep Speech Synthesis from Multimodal Articulatory Representations

TL;DR

Abstract

Deep Speech Synthesis from Multimodal Articulatory Representations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)