Table of Contents
Fetching ...

A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition

Tyler Benster, Guy Wilson, Reshef Elisha, Francis R Willett, Shaul Druckmann

TL;DR

This work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER, demonstrating that SSIs can be a viable alternative to automatic speech recognition (ASR), and opens new possibilities in human-computer interaction.

Abstract

Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment through novel loss functions--cross-contrast (crossCon) and supervised temporal contrast (supTcon)--to train a multimodal model with a shared latent representation. This architecture enables the use of audio-only datasets like LibriSpeech to improve silent speech recognition. Additionally, our introduction of Large Language Model (LLM) Integrated Scoring Adjustment (LISA) significantly improves recognition accuracy. Together, MONA LISA reduces the state-of-the-art word error rate (WER) from 28.8% to 12.2% in the Gaddy (2020) benchmark dataset for silent speech on an open vocabulary. For vocal EMG recordings, our method improves the state-of-the-art from 23.3% to 3.7% WER. In the Brain-to-Text 2024 competition, LISA performs best, improving the top WER from 9.8% to 8.9%. To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER, demonstrating that SSIs can be a viable alternative to automatic speech recognition (ASR). Our work not only narrows the performance gap between silent and vocalized speech but also opens new possibilities in human-computer interaction, demonstrating the potential of cross-modal approaches in noisy and data-limited regimes.

A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition

TL;DR

This work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER, demonstrating that SSIs can be a viable alternative to automatic speech recognition (ASR), and opens new possibilities in human-computer interaction.

Abstract

Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment through novel loss functions--cross-contrast (crossCon) and supervised temporal contrast (supTcon)--to train a multimodal model with a shared latent representation. This architecture enables the use of audio-only datasets like LibriSpeech to improve silent speech recognition. Additionally, our introduction of Large Language Model (LLM) Integrated Scoring Adjustment (LISA) significantly improves recognition accuracy. Together, MONA LISA reduces the state-of-the-art word error rate (WER) from 28.8% to 12.2% in the Gaddy (2020) benchmark dataset for silent speech on an open vocabulary. For vocal EMG recordings, our method improves the state-of-the-art from 23.3% to 3.7% WER. In the Brain-to-Text 2024 competition, LISA performs best, improving the top WER from 9.8% to 8.9%. To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER, demonstrating that SSIs can be a viable alternative to automatic speech recognition (ASR). Our work not only narrows the performance gap between silent and vocalized speech but also opens new possibilities in human-computer interaction, demonstrating the potential of cross-modal approaches in noisy and data-limited regimes.
Paper Structure (21 sections, 6 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 6 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: MONA architecture.
  • Figure 2: Latent space alignment for time-aligned data with crossCon. The similarity of latent embeddings for simultaneously recorded EMG and audio data is maximized for the same timestep, and minimized to other timesteps.
  • Figure 3: Latent space alignment for independent data with supTcon. The similarity of latent embeddings with the same phoneme class /i/ is maximized, and minimized for latent embeddings with different phonemes.
  • Figure 4: Comparison of validation WER for the MONA architecture with varying loss function. -Libri = No LibriSpeech, bal = Balanced audio/EMG sampling.
  • Figure 5: Median of five runs WER and CTC loss by training epoch on validation data using 150 beams.