Investigation of Speech and Noise Latent Representations in Single-channel VAE-based Speech Enhancement
Jiatong Li, Simon Doclo
TL;DR
This work analyzes how latent representations learned by DIP-VAE-based pretrained VAEs influence a PVAE-based single-channel speech enhancement system. By varying the DIP-VAE loss terms, especially the KL regularization, the authors show that disentangled latent spaces that clearly separate speech and noise lead to improved SI-SNR and PESQ across DNS3 and mismatched datasets. The key finding is that omitting the KL term during pretraining (beta=0) yields the strongest enhancements, linking latent separation to better speech reconstruction. Visualizations corroborate that reduced KL pressure promotes more distinct speech/noise latents, suggesting latent-space structure is critical for PVAE performance and guiding future disentangled representations, including potential extensions to complex-valued VAEs.
Abstract
Recently, a variational autoencoder (VAE)-based single-channel speech enhancement system using Bayesian permutation training has been proposed, which uses two pretrained VAEs to obtain latent representations for speech and noise. Based on these pretrained VAEs, a noisy VAE learns to generate speech and noise latent representations from noisy speech for speech enhancement. Modifying the pretrained VAE loss terms affects the pretrained speech and noise latent representations. In this paper, we investigate how these different representations affect speech enhancement performance. Experiments on the DNS3, WSJ0-QUT, and VoiceBank-DEMAND datasets show that a latent space where speech and noise representations are clearly separated significantly improves performance over standard VAEs, which produce overlapping speech and noise representations.
