Table of Contents
Fetching ...

Cosine Scoring with Uncertainty for Neural Speaker Embedding

Qiongqiong Wang, Kong Aik Lee

TL;DR

This work tackles the problem of uncertainty in neural speaker embeddings and its impact on cosine scoring. It proposes uncertainty-aware cosine scoring (UP-Cos), which propagates embedding uncertainty, captured by the posterior covariance $\mathbf{\Sigma}_U$, into the back-end scoring via a Mahalanobis-like normalization; four variants of UP-Cos are explored. Experiments on VoxCeleb1-O/H and SITW show that UP-Cos outperforms conventional cosine scoring and offers competitive or superior performance to uncertainty-propagated PLDA with lower computational cost, achieving average improvements around $8.5$% in EER and $9.8$% in minDCF over the baseline cosine approach. The results demonstrate the practical value of incorporating embedding uncertainty into back-end scoring for robust speaker recognition in real-world conditions.

Abstract

Uncertainty modeling in speaker representation aims to learn the variability present in speech utterances. While the conventional cosine-scoring is computationally efficient and prevalent in speaker recognition, it lacks the capability to handle uncertainty. To address this challenge, this paper proposes an approach for estimating uncertainty at the speaker embedding front-end and propagating it to the cosine scoring back-end. Experiments conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the proposed method in handling uncertainty arising from embedding estimation. It achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF compared to the conventional cosine similarity. It is also computationally efficient in practice.

Cosine Scoring with Uncertainty for Neural Speaker Embedding

TL;DR

This work tackles the problem of uncertainty in neural speaker embeddings and its impact on cosine scoring. It proposes uncertainty-aware cosine scoring (UP-Cos), which propagates embedding uncertainty, captured by the posterior covariance , into the back-end scoring via a Mahalanobis-like normalization; four variants of UP-Cos are explored. Experiments on VoxCeleb1-O/H and SITW show that UP-Cos outperforms conventional cosine scoring and offers competitive or superior performance to uncertainty-propagated PLDA with lower computational cost, achieving average improvements around % in EER and % in minDCF over the baseline cosine approach. The results demonstrate the practical value of incorporating embedding uncertainty into back-end scoring for robust speaker recognition in real-world conditions.

Abstract

Uncertainty modeling in speaker representation aims to learn the variability present in speech utterances. While the conventional cosine-scoring is computationally efficient and prevalent in speaker recognition, it lacks the capability to handle uncertainty. To address this challenge, this paper proposes an approach for estimating uncertainty at the speaker embedding front-end and propagating it to the cosine scoring back-end. Experiments conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the proposed method in handling uncertainty arising from embedding estimation. It achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF compared to the conventional cosine similarity. It is also computationally efficient in practice.
Paper Structure (9 sections, 14 equations, 4 figures, 2 tables)

This paper contains 9 sections, 14 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A speaker-embedding neural network with uncertainty propagation. "BN" and "FC" are shorthand for batch normalization and a fully-connected layer, respectively. The Gaussian distributions are to illustrate that the posterior distribution of an embedding differs from the prior distribution.
  • Figure 2: Boxplots of the diagonal values of the average uncertainty estimation, and the within- and between-speaker covariance matrices estimated on the VoxCeleb2 development set. The off-diagonal values are disregarded.
  • Figure 3: Distribution of the product $\alpha_\text{e} \alpha_\text{t}$ when using the four methods of UP-Cos shown in Table \ref{['tab:tab1']}. The investigation is done on the VoxCeleb1-O test dataset.
  • Figure 4: Scatter plot of average posterior covariance versus utterance duration for Vox1-H and SITW-eval datasets.