Table of Contents
Fetching ...

ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

Yi Ma, Shuai Wang, Tianchi Liu, Haizhou Li

TL;DR

ExPO introduces an explainable phonetic trait-oriented network for speaker verification by embedding phonetic trait representations for each phone and comparing enrollment and test utterances at the trait level. The model inserts trait layers into an ECAPA-TDNN backbone, uses a wav2vec2-based phone recognizer to segment utterances into $I=40$ phones, and derives an utterance embedding of dimension $D_2$ from a trait-pooled representation. Training combines a standard Additive Angular Margin loss with a trait verification loss $\L_{veri}$ and a trait center loss $\L_{center}$, forming $\L_{all} = \L_{AAM} + \L_{veri} + \L_{center}$, to encourage both accuracy and explainability via a phonetic-trait similarity vector $\mathbf{s}$ and an evidence score. Experiments on VoxCeleb2, Vox1-O/E, SITW, and Librispeech demonstrate that ExPO achieves interpretable decision explanations consistent with the final score, while maintaining competitive verification performance; ablation shows both trait losses improve explainability and discriminability analyses reveal meaningful phonetic trait contributions, including non-verbal segments. Code for ExPO is available at https://github.com/mmmmayi/ExPO.

Abstract

In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable results comparable to those from manual forensic voice comparison. A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait which describes the speaker's characteristics at the phonetic level, resembling what forensic comparison does. ExPO not only generates utterance-level speaker embeddings but also allows for fine-grained analysis and visualization of phonetic traits, offering an explainable speaker verification process. Furthermore, we investigate phonetic traits from within-speaker and between-speaker variation perspectives to determine which trait is most effective for speaker verification, marking an important step towards explainable speaker verification. Our code is available at https://github.com/mmmmayi/ExPO.

ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

TL;DR

ExPO introduces an explainable phonetic trait-oriented network for speaker verification by embedding phonetic trait representations for each phone and comparing enrollment and test utterances at the trait level. The model inserts trait layers into an ECAPA-TDNN backbone, uses a wav2vec2-based phone recognizer to segment utterances into phones, and derives an utterance embedding of dimension from a trait-pooled representation. Training combines a standard Additive Angular Margin loss with a trait verification loss and a trait center loss , forming , to encourage both accuracy and explainability via a phonetic-trait similarity vector and an evidence score. Experiments on VoxCeleb2, Vox1-O/E, SITW, and Librispeech demonstrate that ExPO achieves interpretable decision explanations consistent with the final score, while maintaining competitive verification performance; ablation shows both trait losses improve explainability and discriminability analyses reveal meaningful phonetic trait contributions, including non-verbal segments. Code for ExPO is available at https://github.com/mmmmayi/ExPO.

Abstract

In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable results comparable to those from manual forensic voice comparison. A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait which describes the speaker's characteristics at the phonetic level, resembling what forensic comparison does. ExPO not only generates utterance-level speaker embeddings but also allows for fine-grained analysis and visualization of phonetic traits, offering an explainable speaker verification process. Furthermore, we investigate phonetic traits from within-speaker and between-speaker variation perspectives to determine which trait is most effective for speaker verification, marking an important step towards explainable speaker verification. Our code is available at https://github.com/mmmmayi/ExPO.
Paper Structure (15 sections, 6 equations, 4 figures, 1 table)

This paper contains 15 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of the explainability in manual voice comparison and neural speaker verification system. (a) a manual voice comparison process that is explainable; (b) The proposed ExPO model that performs speaker verification in a similar way as manual voice comparison.
  • Figure 2: Block diagram of the proposed ExPO system. (a) ExPO introduces trait layers between frame layers and utterance layers of a standard ECAPA-TDNN network. During training, data are sampled pairwise as inputs, with the same shape indicating utterances from the same speaker. At the inference stage, the system generates the final score and phonetic similarity vector from the speaker embedding and phonetic traits, respectively. (b) The process in a trait embedding layer.
  • Figure 3: Visualization of the evidence provided by ExPO for two trials in Vox1-O. (a) and (d) are the spectrogram of enrollment and test utterances, where phone boundaries are marked by dotted lines. (b) and (e) are the phonetic trait similarity vector $\textbf{s}$ for (a) and (d) respectively. $s(i)$ is the cosine similarity between ${i}$-th phonetic traits in enrollment and test utterance. The black bars indicate the absent phones in the utterances. (c) and (f) are visual comparisons of the spectrum, and the similarity of the phonetic traits.
  • Figure 4: The F-ratio of phonetic trait extracted from different phones. The phones are ordered from highest to lowest based on their mean values across all test sets.