PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification

Massa Baali; Abdulhamid Aldoobi; Hira Dhamyal; Rita Singh; Bhiksha Raj

PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification

Massa Baali, Abdulhamid Aldoobi, Hira Dhamyal, Rita Singh, Bhiksha Raj

TL;DR

A novel Phoneme-Debiasing Attention Framework is introduced, integrating with existing attention frameworks to mitigate biases caused by phonetic dominance and paves the way for more accurate and reliable identity authentication through voice.

Abstract

Speaker verification systems are crucial for authenticating identity through voice. Traditionally, these systems focus on comparing feature vectors, overlooking the speech's content. However, this paper challenges this by highlighting the importance of phonetic dominance, a measure of the frequency or duration of phonemes, as a crucial cue in speaker verification. A novel Phoneme Debiasing Attention Framework (PDAF) is introduced, integrating with existing attention frameworks to mitigate biases caused by phonetic dominance. PDAF adjusts the weighting for each phoneme and influences feature extraction, allowing for a more nuanced analysis of speech. This approach paves the way for more accurate and reliable identity authentication through voice. Furthermore, by employing various weighting strategies, we evaluate the influence of phonetic features on the efficacy of the speaker verification system.

PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification

TL;DR

Abstract

Paper Structure (16 sections, 6 equations, 3 figures, 4 tables)

This paper contains 16 sections, 6 equations, 3 figures, 4 tables.

Introduction
Background and related work
Phoneme-Debiasing Attention Framework
Implementing the model
Phonetic Aligner
Multi-Head Self-Attention
Estimating $\hat{\mathcal{P}}$
Attention Masking
Masking Individual Phonemes
Self-Attentive Pooling with Mean and STD
Speaker Embeddings
Experiments
Datasets and Preprocessing
Experimental Setup
Results and Analysis
...and 1 more sections

Figures (3)

Figure 1: Statistical dependencies in the generation of a meaningful speech signal. The lexical content $L$ determines the phonetic structure $P$, which in turn determines the acoustics $A$. Thus, any production of a signal actually represents the draws of all three variables. Content-agnostic verification systems, however, only consider the acoustics, which are the final variable $A$ (in the dotted circle), ignoring its conditioning on upstream variables $L$ and $P$.
Figure 2: PDAF proposed phonetic integration for speaker recognition.
Figure 3: Mean and STD of the change in EER when a phoneme is masked, computed over 6 models.

PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification

TL;DR

Abstract

PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification

Authors

TL;DR

Abstract

Table of Contents

Figures (3)