Table of Contents
Fetching ...

Xi+: Uncertainty Supervision for Robust Speaker Embedding

Junjie Li, Kong Aik Lee, Duc-Tuan Truong, Tianchi Liu, Man-Wai Mak

TL;DR

This paper addresses the sensitivity of speaker embeddings to frame-level variability such as emotion and language. It proposes xi+ which adds a Transformer-based temporal uncertainty module, a Stochastic Variance Loss for explicit uncertainty supervision, and an uncertainty-aware cosine scoring backend to better exploit frame reliability. The approach yields about 10-11% relative gains on VoxCeleb1-O and the NIST SRE 2024 evaluation, demonstrating improved robustness under challenging conditions. The methods offer practical benefits for robust speaker verification with variable channels and speaking styles, improving reliability in real-world deployments.

Abstract

There are various factors that can influence the performance of speaker recognition systems, such as emotion, language and other speaker-related or context-related variations. Since individual speech frames do not contribute equally to the utterance-level representation, it is essential to estimate the importance or reliability of each frame. The xi-vector model addresses this by assigning different weights to frames based on uncertainty estimation. However, its uncertainty estimation model is implicitly trained through classification loss alone and does not consider the temporal relationships between frames, which may lead to suboptimal supervision. In this paper, we propose an improved architecture, xi+. Compared to xi-vector, xi+ incorporates a temporal attention module to capture frame-level uncertainty in a context-aware manner. In addition, we introduce a novel loss function, Stochastic Variance Loss, which explicitly supervises the learning of uncertainty. Results demonstrate consistent performance improvements of about 10\% on the VoxCeleb1-O set and 11\% on the NIST SRE 2024 evaluation set.

Xi+: Uncertainty Supervision for Robust Speaker Embedding

TL;DR

This paper addresses the sensitivity of speaker embeddings to frame-level variability such as emotion and language. It proposes xi+ which adds a Transformer-based temporal uncertainty module, a Stochastic Variance Loss for explicit uncertainty supervision, and an uncertainty-aware cosine scoring backend to better exploit frame reliability. The approach yields about 10-11% relative gains on VoxCeleb1-O and the NIST SRE 2024 evaluation, demonstrating improved robustness under challenging conditions. The methods offer practical benefits for robust speaker verification with variable channels and speaking styles, improving reliability in real-world deployments.

Abstract

There are various factors that can influence the performance of speaker recognition systems, such as emotion, language and other speaker-related or context-related variations. Since individual speech frames do not contribute equally to the utterance-level representation, it is essential to estimate the importance or reliability of each frame. The xi-vector model addresses this by assigning different weights to frames based on uncertainty estimation. However, its uncertainty estimation model is implicitly trained through classification loss alone and does not consider the temporal relationships between frames, which may lead to suboptimal supervision. In this paper, we propose an improved architecture, xi+. Compared to xi-vector, xi+ incorporates a temporal attention module to capture frame-level uncertainty in a context-aware manner. In addition, we introduce a novel loss function, Stochastic Variance Loss, which explicitly supervises the learning of uncertainty. Results demonstrate consistent performance improvements of about 10\% on the VoxCeleb1-O set and 11\% on the NIST SRE 2024 evaluation set.

Paper Structure

This paper contains 15 sections, 12 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The architecture of xi+, an extended version of the xi-vector model lee2021xi. The modules shown in gray correspond to components of the original xi-vector system, while the modules highlighted in other colors represent our proposed extensions. In the diagram, $\mathbf{z}_p$ denotes the prior mean vector, and $\mathbf{L}_p$ denotes the prior precision matrix. $\otimes$ and $\ominus$ denotes element-wise multiplication and concatenation, respectively. BN$^\text{V}$ and FC3$^\text{V}$ constitute the variance-processing branch, which is structurally parallel to the mean branch. Specifically, they share parameters with BN2 and FC3 in the mean branch, ensuring consistency across the two branches. As shown in distribution space, all frame-level Gaussian distributions are integrated to form an utterance-level Gaussian distribution, which serves as a compact representation of the speaker.