Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

Neil Shah; Shirish Karande; Vineet Gandhi

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

Neil Shah, Shirish Karande, Vineet Gandhi

TL;DR

The paper addresses NAM-to-speech intelligibility by moving beyond studio-ground-truth data and leveraging self-supervised learning. It introduces ground-truth speech simulation from whisper data, data augmentation to expand the NAM dataset, and a non-autoregressive Seq2Seq model trained with MSE and CTC losses, guided by HuBERT embeddings and a HiFiGAN vocoder. The approach achieves a 29.08% relative reduction in Mel-Cepstral Distortion over the current SOTA using simulated ground-truth speech, and data augmentation further improves intelligibility as measured by WER and CER, while enabling synthesis in novel voices. This work demonstrates the viability of SSL-based NAM-to-speech pipelines without studio data, with practical implications for silent communication and personalized voice synthesis.

Abstract

We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task, leveraging self-supervision and sequence-to-sequence (Seq2Seq) learning techniques. Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis to simulate ground-truth speech. Despite utilizing simulated speech, our method surpasses the current state-of-the-art (SOTA) by 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric. Additionally, we present error rates and demonstrate our model's proficiency to synthesize speech in novel voices of interest. Moreover, we present a methodology for augmenting the existing CSTR NAM TIMIT Plus corpus, setting a benchmark with a Word Error Rate (WER) of 42.57% to gauge the intelligibility of the synthesized speech. Speech samples can be found at https://nam2speech.github.io/NAM2Speech/

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

TL;DR

Abstract

Paper Structure (15 sections, 3 equations, 2 figures, 2 tables)

This paper contains 15 sections, 3 equations, 2 figures, 2 tables.

Introduction
Related work
Method
Speech encoder
Ground-truth speech simulation
Data augmentation
Time alignment of representations
Seq2Seq network
Speech vocoder
Dataset
Results and discussion
Recognition performance with no data augmentation
Recognition performance with data augmentation
Qualitative evaluation
Conclusion

Figures (2)

Figure 1: Proposed methodology overview: (A) Ground-truth speech simulation from whisper speech, (B) Data Augmentation with LJSpeech and DTW Algorithm to generate time-aligned LJNAM samples in a NAM-like speaking voice, (C) Seq2Seq Learning Framework, and (D) Inference Pipeline for voice Synthesis in NAM-to-Speech Conversion Task. Green boxes denote pre-trained or frozen components, while grey boxes signify training modules.
Figure 2: Mel-spectrogram comparison of (A) original NAM signal and synthesized speech from (B) DiscoGAN, (C) MSpec-Net, and (D) our proposed method. ID: 401, Text: "It is a terrible loss". The white dotted box showcases our method's superior ability to preserve and accurately estimate formants compared to MSpec-Net.

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

TL;DR

Abstract

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)