Table of Contents
Fetching ...

Variational Autoencoder for Personalized Pathological Speech Enhancement

Mingchi Hou, Ina Kodrasi

TL;DR

This work analyzes the generalizability of a hybrid VAE-NMF speech enhancement framework across languages and speaker conditions, emphasizing pathological speech associated with Parkinson's disease. It employs MCEM to estimate variances and Wiener gains, with a clean-speech prior learned by a VAE and a separate NMF noise model. The authors show that models trained on neurotypical data poorly generalize to pathological speech, but fine-tuning on mixed data and, more effectively, per-speaker personalization using only a few seconds of clean data substantially improves performance for both groups, achieving parity between neurotypical and pathological speakers. These findings highlight the practical potential of personalized SE for PD populations, enabling robust enhancement with minimal per-user data.

Abstract

The generalizability of speech enhancement (SE) models across speaker conditions remains largely unexplored, despite its critical importance for broader applicability. This paper investigates the performance of the hybrid variational autoencoder (VAE)-non-negative matrix factorization (NMF) model for SE, focusing primarily on its generalizability to pathological speakers with Parkinson's disease. We show that VAE models trained on large neurotypical datasets perform poorly on pathological speech. While fine-tuning these pre-trained models with pathological speech improves performance, a performance gap remains between neurotypical and pathological speakers. To address this gap, we propose using personalized SE models derived from fine-tuning pre-trained models with only a few seconds of clean data from each speaker. Our results demonstrate that personalized models considerably enhance performance for all speakers, achieving comparable results for both neurotypical and pathological speakers.

Variational Autoencoder for Personalized Pathological Speech Enhancement

TL;DR

This work analyzes the generalizability of a hybrid VAE-NMF speech enhancement framework across languages and speaker conditions, emphasizing pathological speech associated with Parkinson's disease. It employs MCEM to estimate variances and Wiener gains, with a clean-speech prior learned by a VAE and a separate NMF noise model. The authors show that models trained on neurotypical data poorly generalize to pathological speech, but fine-tuning on mixed data and, more effectively, per-speaker personalization using only a few seconds of clean data substantially improves performance for both groups, achieving parity between neurotypical and pathological speakers. These findings highlight the practical potential of personalized SE for PD populations, enabling robust enhancement with minimal per-user data.

Abstract

The generalizability of speech enhancement (SE) models across speaker conditions remains largely unexplored, despite its critical importance for broader applicability. This paper investigates the performance of the hybrid variational autoencoder (VAE)-non-negative matrix factorization (NMF) model for SE, focusing primarily on its generalizability to pathological speakers with Parkinson's disease. We show that VAE models trained on large neurotypical datasets perform poorly on pathological speech. While fine-tuning these pre-trained models with pathological speech improves performance, a performance gap remains between neurotypical and pathological speakers. To address this gap, we propose using personalized SE models derived from fine-tuning pre-trained models with only a few seconds of clean data from each speaker. Our results demonstrate that personalized models considerably enhance performance for all speakers, achieving comparable results for both neurotypical and pathological speakers.

Paper Structure

This paper contains 13 sections, 5 equations, 4 tables.