Table of Contents
Fetching ...

Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

Neil Shah, Shirish Karande, Vineet Gandhi

TL;DR

This work tackles NAM-to-speech by addressing the reliance on paired whispers and the limited generalization of ground-truth synthesis. It introduces a two-step framework: first simulate ground-truth speech from NAMs using whisper-based, forced-alignment, or vision-based methods, then train a Seq2Seq model to map NAMs to speech; to further reduce dependence on audio data, it proposes a diffusion-based lip-to-speech approach (Diff-NAM) conditioned on video and simulated NAMs. A new MultiNAM dataset with $7.96$ hours of paired NAM, whisper, video, and text from two speakers is released to benchmark methods across modalities. The results show that Diff-NAM yields the lowest ground-truth and converted-speech error rates among lip-to-speech baselines, highlighting the value of content-specific diffusion conditioning. Overall, the work advances robust NAM-to-speech in both high- and resource-scarce regimes and broadens the modalities available for ground-truth simulation and synthesis.

Abstract

Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over 7.96 hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and the dataset are available at https://diff-nam.github.io/DiffNAM/

Advancing NAM-to-Speech Conversion with Novel Methods and the MultiNAM Dataset

TL;DR

This work tackles NAM-to-speech by addressing the reliance on paired whispers and the limited generalization of ground-truth synthesis. It introduces a two-step framework: first simulate ground-truth speech from NAMs using whisper-based, forced-alignment, or vision-based methods, then train a Seq2Seq model to map NAMs to speech; to further reduce dependence on audio data, it proposes a diffusion-based lip-to-speech approach (Diff-NAM) conditioned on video and simulated NAMs. A new MultiNAM dataset with hours of paired NAM, whisper, video, and text from two speakers is released to benchmark methods across modalities. The results show that Diff-NAM yields the lowest ground-truth and converted-speech error rates among lip-to-speech baselines, highlighting the value of content-specific diffusion conditioning. Overall, the work advances robust NAM-to-speech in both high- and resource-scarce regimes and broadens the modalities available for ground-truth simulation and synthesis.

Abstract

Current Non-Audible Murmur (NAM)-to-speech techniques rely on voice cloning to simulate ground-truth speech from paired whispers. However, the simulated speech often lacks intelligibility and fails to generalize well across different speakers. To address this issue, we focus on learning phoneme-level alignments from paired whispers and text and employ a Text-to-Speech (TTS) system to simulate the ground-truth. To reduce dependence on whispers, we learn phoneme alignments directly from NAMs, though the quality is constrained by the available training data. To further mitigate reliance on NAM/whisper data for ground-truth simulation, we propose incorporating the lip modality to infer speech and introduce a novel diffusion-based method that leverages recent advancements in lip-to-speech technology. Additionally, we release the MultiNAM dataset with over 7.96 hours of paired NAM, whisper, video, and text data from two speakers and benchmark all methods on this dataset. Speech samples and the dataset are available at https://diff-nam.github.io/DiffNAM/

Paper Structure

This paper contains 11 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Data recording setup: A laptop displays text while recording the speaker’s face and whispering voice. A stethoscope head placed behind the ear captures NAM vibrations.