Table of Contents
Fetching ...

Disentangled Representation Learning for Environment-agnostic Speaker Recognition

KiHyun Nam, Hee-Soo Heo, Jee-weon Jung, Joon Son Chung

TL;DR

The paper tackles environment-induced variability in speaker recognition by learning environment-agnostic speaker embeddings. It introduces an auto-encoder-based disentangler that splits embeddings into speaker and environment factors, guided by reconstruction losses, embedding swapping, and multiple discriminators, with gradient reversal and correlation regularisation to prevent leakage of environmental information. The framework is designed to work with any existing embedding extractor and demonstrated with ResNet-34 and ECAPA-TDNN backbones, achieving up to 16% improvements on environment-robust evaluation sets while improving generalisation over prior adversarial DRL methods. This approach offers a practical, plug-in solution for robust speaker verification in real-world, acoustically diverse conditions and comes with open-source code for replication and further research.

Abstract

This work presents a framework based on feature disentanglement to learn speaker embeddings that are robust to environmental variations. Our framework utilises an auto-encoder as a disentangler, dividing the input speaker embedding into components related to the speaker and other residual information. We employ a group of objective functions to ensure that the auto-encoder's code representation - used as the refined embedding - condenses only the speaker characteristics. We show the versatility of our framework through its compatibility with any existing speaker embedding extractor, requiring no structural modifications or adaptations for integration. We validate the effectiveness of our framework by incorporating it into two popularly used embedding extractors and conducting experiments across various benchmarks. The results show a performance improvement of up to 16%. We release our code for this work to be available https://github.com/kaistmm/voxceleb-disentangler

Disentangled Representation Learning for Environment-agnostic Speaker Recognition

TL;DR

The paper tackles environment-induced variability in speaker recognition by learning environment-agnostic speaker embeddings. It introduces an auto-encoder-based disentangler that splits embeddings into speaker and environment factors, guided by reconstruction losses, embedding swapping, and multiple discriminators, with gradient reversal and correlation regularisation to prevent leakage of environmental information. The framework is designed to work with any existing embedding extractor and demonstrated with ResNet-34 and ECAPA-TDNN backbones, achieving up to 16% improvements on environment-robust evaluation sets while improving generalisation over prior adversarial DRL methods. This approach offers a practical, plug-in solution for robust speaker verification in real-world, acoustically diverse conditions and comes with open-source code for replication and further research.

Abstract

This work presents a framework based on feature disentanglement to learn speaker embeddings that are robust to environmental variations. Our framework utilises an auto-encoder as a disentangler, dividing the input speaker embedding into components related to the speaker and other residual information. We employ a group of objective functions to ensure that the auto-encoder's code representation - used as the refined embedding - condenses only the speaker characteristics. We show the versatility of our framework through its compatibility with any existing speaker embedding extractor, requiring no structural modifications or adaptations for integration. We validate the effectiveness of our framework by incorporating it into two popularly used embedding extractors and conducting experiments across various benchmarks. The results show a performance improvement of up to 16%. We release our code for this work to be available https://github.com/kaistmm/voxceleb-disentangler
Paper Structure (15 sections, 4 equations, 1 figure, 1 table)

This paper contains 15 sections, 4 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: The illustration of the proposed environment-disentangled representation learning framework. Auto-encoder encodes the speaker network's entangled speaker representation into a compact latent vector, which is then divided into distinct speaker and environment representation vectors. Orange box represents a set of objective functions to facilitate the learning of refined speaker and environment representations from the auto-encoder's bottleneck representation. Reconstruction training of the auto-encoder minimises the loss of vital speaker information during the disentangled representation learning.