Cosine Scoring with Uncertainty for Neural Speaker Embedding
Qiongqiong Wang, Kong Aik Lee
TL;DR
This work tackles the problem of uncertainty in neural speaker embeddings and its impact on cosine scoring. It proposes uncertainty-aware cosine scoring (UP-Cos), which propagates embedding uncertainty, captured by the posterior covariance $\mathbf{\Sigma}_U$, into the back-end scoring via a Mahalanobis-like normalization; four variants of UP-Cos are explored. Experiments on VoxCeleb1-O/H and SITW show that UP-Cos outperforms conventional cosine scoring and offers competitive or superior performance to uncertainty-propagated PLDA with lower computational cost, achieving average improvements around $8.5$% in EER and $9.8$% in minDCF over the baseline cosine approach. The results demonstrate the practical value of incorporating embedding uncertainty into back-end scoring for robust speaker recognition in real-world conditions.
Abstract
Uncertainty modeling in speaker representation aims to learn the variability present in speech utterances. While the conventional cosine-scoring is computationally efficient and prevalent in speaker recognition, it lacks the capability to handle uncertainty. To address this challenge, this paper proposes an approach for estimating uncertainty at the speaker embedding front-end and propagating it to the cosine scoring back-end. Experiments conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the proposed method in handling uncertainty arising from embedding estimation. It achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF compared to the conventional cosine similarity. It is also computationally efficient in practice.
