Quantifying Source Speaker Leakage in One-to-One Voice Conversion
Scott Wellington, Xuechen Liu, Junichi Yamagishi
TL;DR
The paper addresses the risk that one-to-one voice conversion leaks source-speaker identity. It introduces a white-box evaluation framework using an SSL-based VC pipeline (HuBERT for content, ECAPA for speaker embeddings) and a HiFi-GAN vocoder, then quantifies leakage with the Earth Mover's Distance ($EMD$) between cosine-similarity distributions among the source ($D$), target ($P$), and converted ($P'$) utterances. Leakage is formalized as $L = \frac{EMD(B,G)}{EMD(R,G)}$ with $B=cos(P,D)$, $R=cos(P',D)$, and $G=cos(P',P)$, where higher $L$ indicates greater susceptibility of the source speaker to identification; analyses span SPEECON and VCTK datasets across gender, age, accent, and environment mismatches. The results reveal a sliding-scale of information leakage influenced by mismatches (notably gender and certain accents), discuss potential artefacts from model bias, and argue for privacy-focused strategies and broader model testing to dampen leakage. The study highlights practical privacy implications for synthetic-voice services and calls for future work on robust disentanglement and multi-model evaluations.
Abstract
Using a multi-accented corpus of parallel utterances for use with commercial speech devices, we present a case study to show that it is possible to quantify a degree of confidence about a source speaker's identity in the case of one-to-one voice conversion. Following voice conversion using a HiFi-GAN vocoder, we compare information leakage for a range speaker characteristics; assuming a "worst-case" white-box scenario, we quantify our confidence to perform inference and narrow the pool of likely source speakers, reinforcing the regulatory obligation and moral duty that providers of synthetic voices have to ensure the privacy of their speakers' data.
