Table of Contents
Fetching ...

Phonetic Richness for Improved Automatic Speaker Verification

Nicholas Klein, Ganesh Sivaraman, Elie Khoury

TL;DR

This work introduces phonetic richness as a quality metric for automatic speaker verification, defining count-unique (CU) and weighted count-unique (WCU) measures derived from ASR transcripts. A logistic regression calibration framework incorporates CU, WCU, and net-speech to improve ASV scores, yielding up to a relative $EER$ improvement of 5.8% on VoxCeleb1 and notable gains for short utterances. Experiments demonstrate that CU and WCU correlate with speaker-matching scores, are particularly effective when lexical content is repetitive, and that learned phoneme weights highlight informative phoneme classes (e.g., nasal consonants, certain affricates). The approach is complementary to net-speech and shows promise for robust short-utterance verification, with future work on language generalization and non-transcript-based richness measures.

Abstract

When it comes to authentication in speaker verification systems, not all utterances are created equal. It is essential to estimate the quality of test utterances in order to account for varying acoustic conditions. In addition to the net-speech duration of an utterance, it is observed in this paper that phonetic richness is also a key indicator of utterance quality, playing a significant role in accurate speaker verification. Several phonetic histogram based formulations of phonetic richness are explored using transcripts obtained from an automatic speaker recognition system. The proposed phonetic richness measure is found to be positively correlated with voice authentication scores across evaluation benchmarks. Additionally, the proposed measure in combination with net speech helps in calibrating the speaker verification scores, obtaining a relative EER improvement of 5.8% on the Voxceleb1 evaluation protocol. The proposed phonetic richness based calibration provides higher benefit for short utterances with repeated words.

Phonetic Richness for Improved Automatic Speaker Verification

TL;DR

This work introduces phonetic richness as a quality metric for automatic speaker verification, defining count-unique (CU) and weighted count-unique (WCU) measures derived from ASR transcripts. A logistic regression calibration framework incorporates CU, WCU, and net-speech to improve ASV scores, yielding up to a relative improvement of 5.8% on VoxCeleb1 and notable gains for short utterances. Experiments demonstrate that CU and WCU correlate with speaker-matching scores, are particularly effective when lexical content is repetitive, and that learned phoneme weights highlight informative phoneme classes (e.g., nasal consonants, certain affricates). The approach is complementary to net-speech and shows promise for robust short-utterance verification, with future work on language generalization and non-transcript-based richness measures.

Abstract

When it comes to authentication in speaker verification systems, not all utterances are created equal. It is essential to estimate the quality of test utterances in order to account for varying acoustic conditions. In addition to the net-speech duration of an utterance, it is observed in this paper that phonetic richness is also a key indicator of utterance quality, playing a significant role in accurate speaker verification. Several phonetic histogram based formulations of phonetic richness are explored using transcripts obtained from an automatic speaker recognition system. The proposed phonetic richness measure is found to be positively correlated with voice authentication scores across evaluation benchmarks. Additionally, the proposed measure in combination with net speech helps in calibrating the speaker verification scores, obtaining a relative EER improvement of 5.8% on the Voxceleb1 evaluation protocol. The proposed phonetic richness based calibration provides higher benefit for short utterances with repeated words.
Paper Structure (11 sections, 2 equations, 3 figures, 3 tables)

This paper contains 11 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: T-SNE plot of ECAPA-TDNN speaker verification embeddings for utterances with similar net-speech but varying number of unique phonemes, spoken by APLAWD speakers 'a' through 'e'. Number markers indicate the number of unique phonemes in the utterance while their color distinguishes speakers. Clusters of utterances with more phonemes are circled in red.
  • Figure 2: ASV score as a function of phonetic richness measures and net speech for Aplawd-Repetitive test utterances. Positive (blue) and negative (orange) pairs are plotted separately to observe class separation patterns, and Kendall's $\tau$ is computed for each class.
  • Figure 3: Distribution of learned phoneme-specific weights compared with the frequency that each phoneme occurs in the data used for fitting the weights. Phonemes having a larger weight relative to their frequency suggests that they are more useful for carrying speaker-identifying information.