A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

Xujiang Xing; Mingxing Xu; Thomas Fang Zheng

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

Xujiang Xing, Mingxing Xu, Thomas Fang Zheng

TL;DR

The paper addresses automatic speaker verification robustness under noise by introducing a noise-disentanglement adversarial learning framework that learns noise-independent speaker embeddings. It combines a disentanglement module (with a speaker encoder E_s, a speaker-irrelevant encoder E_i, and a reconstruction module D), a feature-robust loss, and adversarial training via a gradient reversal layer to promote speaker-invariant representations. The approach optimizes the joint objective $L = L_{rec} + L_{fr} + L_{cls} - \lambda L_{adv}$ and demonstrates substantial EER reductions on VoxCeleb1 for both seen and unseen noisy conditions, outperforming NDML and baseline joint-training. The results highlight the practical impact of forming a robust, noise-agnostic embedding space for speaker verification in real-world environments.

Abstract

Automatic Speaker Verification (ASV) suffers from performance degradation in noisy conditions. To address this issue, we propose a novel adversarial learning framework that incorporates noise-disentanglement to establish a noise-independent speaker invariant embedding space. Specifically, the disentanglement module includes two encoders for separating speaker related and irrelevant information, respectively. The reconstruction module serves as a regularization term to constrain the noise. A feature-robust loss is also used to supervise the speaker encoder to learn noise-independent speaker embeddings without losing speaker information. In addition, adversarial training is introduced to discourage the speaker encoder from encoding acoustic condition information for achieving a speaker-invariant embedding space. Experiments on VoxCeleb1 indicate that the proposed method improves the performance of the speaker verification system under both clean and noisy conditions.

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

TL;DR

and demonstrates substantial EER reductions on VoxCeleb1 for both seen and unseen noisy conditions, outperforming NDML and baseline joint-training. The results highlight the practical impact of forming a robust, noise-agnostic embedding space for speaker verification in real-world environments.

Abstract

Paper Structure (13 sections, 6 equations, 2 figures, 2 tables)

This paper contains 13 sections, 6 equations, 2 figures, 2 tables.

Introduction
Related Work
TDNN for deep speaker embedding
NDML-based method
Proposed Methods
Noise disentanglement
Adversarial training
Experiments
Datasets
Implementation details
Results
Ablation studies
Conclusion

Figures (2)

Figure 1: The architecture of noise-disentanglement adversarial training. Identical symbols correspond to the same speaker. The data depicted in blue and red denote the original and the augmented datasets, respectively.
Figure 2: The t-SNE visualization of speaker embeddings: (a, b) visible noise vs. (c, d) invisible noise conditions.

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

TL;DR

Abstract

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (2)