Table of Contents
Fetching ...

Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

Jisoo Park, Seonghak Lee, Guisik Kim, Taewoo Kim, Junseok Kwon

TL;DR

Real‑world speech often involves both background noise and overlapping speakers, motivating a unified SE/SS solution. The authors propose UniVoiceLite, a lightweight unsupervised audio‑visual Wasserstein autoencoder that uses lip motion and facial identity as visual priors and replaces KL regularization with Wasserstein distance to stabilize the latent space. The model jointly performs speech enhancement and separation without paired noisy‑clean data and demonstrates strong SDR/STOI performance and generalization across noisy and multi‑speaker scenarios with only 2.3M parameters, aided by an MCEM‑based Wiener post‑processing step. The work delivers an efficient, scalable approach for robust speech processing in realistic environments. $\boldsymbol{z}_n$ and $\mathcal{W}$ (Wasserstein distance) are central to the latent space design and regularization.

Abstract

Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.

Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

TL;DR

Real‑world speech often involves both background noise and overlapping speakers, motivating a unified SE/SS solution. The authors propose UniVoiceLite, a lightweight unsupervised audio‑visual Wasserstein autoencoder that uses lip motion and facial identity as visual priors and replaces KL regularization with Wasserstein distance to stabilize the latent space. The model jointly performs speech enhancement and separation without paired noisy‑clean data and demonstrates strong SDR/STOI performance and generalization across noisy and multi‑speaker scenarios with only 2.3M parameters, aided by an MCEM‑based Wiener post‑processing step. The work delivers an efficient, scalable approach for robust speech processing in realistic environments. and (Wasserstein distance) are central to the latent space design and regularization.

Abstract

Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.

Paper Structure

This paper contains 9 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Advantages of the Proposed WAE for Integrated Speech Enhancement and Separation Tasks. (a) Traditional VAE exhibits significant overlap in the latent space due to many-to-one mapping, leading to blurry and less precise reconstructions. (b) The proposed WAE reduces this overlap and produces reconstructions that more accurately resemble the clean input (highlighted with blue dashed boxes), making it well-suited for integrated speech enhancement and separation tasks.
  • Figure 2: Illustration of the Proposed UniVoiceLite Pipeline. During training, the model takes visual features $v_n$, including lip motion and facial attributes, along with clean speech features ${s}_n$ as inputs to the encoder, which maps them to a latent space. The decoder then reconstructs the speech signal using the visual prior $v_n$. During inference, the model processes noisy audio and speaker-specific visual information from multiple speakers. UniVoiceLite then separates the speech components, generating enhanced speech for each speaker while effectively filtering out background noise.
  • Figure 3: Speech Enhancement (SE) Comparison Using Mel-Spectrogram. Top: Spectrograms of ground truth, ours, RVAE, and AV-VAE. Bottom: Zoomed-in highlighted regions.
  • Figure 4: Speech Separation (SS) Comparison in 3 Speakers Scenario Using SDR, STOI Metrics.
  • Figure 5: Ablation Study (Mel-Spectrogram Comparison). Removing $V$ (visual features) leads to noticeable distortions, while the absence of $W$ (Wasserstein distance) results in blurred and less structured reconstructions.