Table of Contents
Fetching ...

SEED: Speaker Embedding Enhancement Diffusion Model

KiHyun Nam, Jungwoo Heo, Jee-weon Jung, Gangin Park, Chaeyoung Jung, Ha-Jin Yu, Joon Son Chung

TL;DR

SEED addresses environmental mismatch in speaker recognition by applying a diffusion model directly to speaker embeddings, aligning clean and noisy representations without needing speaker labels or altering existing pipelines. It trains on clean data with multi-pair augmentation and uses forward diffusion on both clean and noisy embeddings, with a cross-embedding reverse process that refines noisy outputs toward the clean embedding. The method achieves strong improvements in mismatched conditions (up to ~19.6% relative gain) while maintaining performance on conventional data, using a lightweight diffusion backbone and single-step sampling. This embedding-level diffusion offers a practical, data-efficient path to robust real-world speaker recognition and can be deployed on top of diverse pre-trained embedding extractors. Future work aims to explicitly model the domain-mismatch gap to improve stability under extreme conditions.

Abstract

A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here https://github.com/kaistmm/seed-pytorch

SEED: Speaker Embedding Enhancement Diffusion Model

TL;DR

SEED addresses environmental mismatch in speaker recognition by applying a diffusion model directly to speaker embeddings, aligning clean and noisy representations without needing speaker labels or altering existing pipelines. It trains on clean data with multi-pair augmentation and uses forward diffusion on both clean and noisy embeddings, with a cross-embedding reverse process that refines noisy outputs toward the clean embedding. The method achieves strong improvements in mismatched conditions (up to ~19.6% relative gain) while maintaining performance on conventional data, using a lightweight diffusion backbone and single-step sampling. This embedding-level diffusion offers a practical, data-efficient path to robust real-world speaker recognition and can be deployed on top of diverse pre-trained embedding extractors. Future work aims to explicitly model the domain-mismatch gap to improve stability under extreme conditions.

Abstract

A primary challenge when deploying speaker recognition systems in real-world applications is performance degradation caused by environmental mismatch. We propose a diffusion-based method that takes speaker embeddings extracted from a pre-trained speaker recognition model and generates refined embeddings. For training, our approach progressively adds Gaussian noise to both clean and noisy speaker embeddings extracted from clean and noisy speech, respectively, via forward process of a diffusion model, and then reconstructs them to clean embeddings in the reverse process. While inferencing, all embeddings are regenerated via diffusion process. Our method needs neither speaker label nor any modification to the existing speaker recognition pipeline. Experiments on evaluation sets simulating environment mismatch scenarios show that our method can improve recognition accuracy by up to 19.6% over baseline models while retaining performance on conventional scenarios. We publish our code here https://github.com/kaistmm/seed-pytorch

Paper Structure

This paper contains 19 sections, 12 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Illustration of Speaker Embedding Enhancement Diffusion (SEED) model. (a) explains the concept of our diffusion mechanism. (b) shows the whole training process of SEED. (c) illustrates the architecture of SEED.