Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

Jisoo Park; Seonghak Lee; Guisik Kim; Taewoo Kim; Junseok Kwon

Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

Jisoo Park, Seonghak Lee, Guisik Kim, Taewoo Kim, Junseok Kwon

TL;DR

Real‑world speech often involves both background noise and overlapping speakers, motivating a unified SE/SS solution. The authors propose UniVoiceLite, a lightweight unsupervised audio‑visual Wasserstein autoencoder that uses lip motion and facial identity as visual priors and replaces KL regularization with Wasserstein distance to stabilize the latent space. The model jointly performs speech enhancement and separation without paired noisy‑clean data and demonstrates strong SDR/STOI performance and generalization across noisy and multi‑speaker scenarios with only 2.3M parameters, aided by an MCEM‑based Wiener post‑processing step. The work delivers an efficient, scalable approach for robust speech processing in realistic environments. $\boldsymbol{z}_n$ and $\mathcal{W}$ (Wasserstein distance) are central to the latent space design and regularization.

Abstract

Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.

Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

TL;DR

Abstract

Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)