Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization
Natalia Tomashenko, Emmanuel Vincent, Marc Tommasi
TL;DR
This work reveals that speaker identity can be inferred from speech temporal dynamics, notably phoneme durations, even in anonymized speech. It introduces two metrics, $\rho_1$ and $\rho_2$, to perform ASV using phoneme-duration patterns and evaluates them on LibriSpeech with two anonymization systems (SAS-1 and SAS-2). The results show that phoneme timing leaks persist under both original and anonymized speech, with $\rho_2$ generally more effective than $\rho_1$, and that increasing phoneme-set granularity does not necessarily improve performance. Normalizing speech rate can modestly affect ASV outcomes, and while SAS-2 offers stronger privacy by altering phoneme durations, residual speaker cues remain at higher utterance counts. The findings underscore the importance of incorporating temporal-dynamics considerations into voice-anonymization designs and point to future ML-based analyses to enhance privacy through temporal normalization and attention-based approaches.
Abstract
In this paper, we investigate the impact of speech temporal dynamics in application to automatic speaker verification and speaker voice anonymization tasks. We propose several metrics to perform automatic speaker verification based only on phoneme durations. Experimental results demonstrate that phoneme durations leak some speaker information and can reveal speaker identity from both original and anonymized speech. Thus, this work emphasizes the importance of taking into account the speaker's speech rate and, more importantly, the speaker's phonetic duration characteristics, as well as the need to modify them in order to develop anonymization systems with strong privacy protection capacity.
