Table of Contents
Fetching ...

Explainable Attribute-Based Speaker Verification

Xiaoliang Wu, Chau Luu, Peter Bell, Ajitha Rajan

TL;DR

This work tackles trust in speaker verification by integrating human-interpretable attributes into a two-stage, explainable framework inspired by the Concept-Bottleneck Model. Stage-1 trains attribute classifiers (using both embeddings from Xvector/ECAPA and MFCC-based AC) to predict gender, nationality, age, and profession; Stage-2 uses the resulting attribute predictions to form pairwise similarity vectors and train a SV predictor. The approach achieves best-stage performance around $0.18$ EER on VoxCeleb1 with softmax-label similarity, approaching the ground-truth $0.15$ that assumes all attributes are correct, but sacrifices some SV accuracy due to a limited attribute set. Overall, the paper demonstrates the feasibility of transparent SV and highlights directions for improving performance through attribute expansion while maintaining interpretability.

Abstract

This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings. We believe this approach better aligns with human reasoning, making it more understandable than traditional methods. Evaluated on the Voxceleb1 test set, the best performance of our system is comparable with the ground truth established when using all correct attributes, proving its efficacy. Whilst our approach sacrifices some performance compared to non-explainable methods, we believe that it moves us closer to the goal of transparent, interpretable AI and lays the groundwork for future enhancements through attribute expansion.

Explainable Attribute-Based Speaker Verification

TL;DR

This work tackles trust in speaker verification by integrating human-interpretable attributes into a two-stage, explainable framework inspired by the Concept-Bottleneck Model. Stage-1 trains attribute classifiers (using both embeddings from Xvector/ECAPA and MFCC-based AC) to predict gender, nationality, age, and profession; Stage-2 uses the resulting attribute predictions to form pairwise similarity vectors and train a SV predictor. The approach achieves best-stage performance around EER on VoxCeleb1 with softmax-label similarity, approaching the ground-truth that assumes all attributes are correct, but sacrifices some SV accuracy due to a limited attribute set. Overall, the paper demonstrates the feasibility of transparent SV and highlights directions for improving performance through attribute expansion while maintaining interpretability.

Abstract

This paper proposes a fully explainable approach to speaker verification (SV), a task that fundamentally relies on individual speaker characteristics. The opaque use of speaker attributes in current SV systems raises concerns of trust. Addressing this, we propose an attribute-based explainable SV system that identifies speakers by comparing personal attributes such as gender, nationality, and age extracted automatically from voice recordings. We believe this approach better aligns with human reasoning, making it more understandable than traditional methods. Evaluated on the Voxceleb1 test set, the best performance of our system is comparable with the ground truth established when using all correct attributes, proving its efficacy. Whilst our approach sacrifices some performance compared to non-explainable methods, we believe that it moves us closer to the goal of transparent, interpretable AI and lays the groundwork for future enhancements through attribute expansion.
Paper Structure (12 sections, 1 figure, 3 tables)

This paper contains 12 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Stage-2 of our SV system is shown. Pretrained classifiers from stage-1 is used to extract attribute labels from pairs of audio samples. These attributes are then fed into a computation block that calculates a similarity vector for the corresponding pair of audio using hard or softmax similarity. The similarity vector is then used to train a stage-2 Machine Learning Model, shown in red, which is the only component being trained during this stage. The output is the final similarity score, which quantifies the likelihood that the two audio inputs are from the same speaker.