Table of Contents
Fetching ...

Quantifying Source Speaker Leakage in One-to-One Voice Conversion

Scott Wellington, Xuechen Liu, Junichi Yamagishi

TL;DR

The paper addresses the risk that one-to-one voice conversion leaks source-speaker identity. It introduces a white-box evaluation framework using an SSL-based VC pipeline (HuBERT for content, ECAPA for speaker embeddings) and a HiFi-GAN vocoder, then quantifies leakage with the Earth Mover's Distance ($EMD$) between cosine-similarity distributions among the source ($D$), target ($P$), and converted ($P'$) utterances. Leakage is formalized as $L = \frac{EMD(B,G)}{EMD(R,G)}$ with $B=cos(P,D)$, $R=cos(P',D)$, and $G=cos(P',P)$, where higher $L$ indicates greater susceptibility of the source speaker to identification; analyses span SPEECON and VCTK datasets across gender, age, accent, and environment mismatches. The results reveal a sliding-scale of information leakage influenced by mismatches (notably gender and certain accents), discuss potential artefacts from model bias, and argue for privacy-focused strategies and broader model testing to dampen leakage. The study highlights practical privacy implications for synthetic-voice services and calls for future work on robust disentanglement and multi-model evaluations.

Abstract

Using a multi-accented corpus of parallel utterances for use with commercial speech devices, we present a case study to show that it is possible to quantify a degree of confidence about a source speaker's identity in the case of one-to-one voice conversion. Following voice conversion using a HiFi-GAN vocoder, we compare information leakage for a range speaker characteristics; assuming a "worst-case" white-box scenario, we quantify our confidence to perform inference and narrow the pool of likely source speakers, reinforcing the regulatory obligation and moral duty that providers of synthetic voices have to ensure the privacy of their speakers' data.

Quantifying Source Speaker Leakage in One-to-One Voice Conversion

TL;DR

The paper addresses the risk that one-to-one voice conversion leaks source-speaker identity. It introduces a white-box evaluation framework using an SSL-based VC pipeline (HuBERT for content, ECAPA for speaker embeddings) and a HiFi-GAN vocoder, then quantifies leakage with the Earth Mover's Distance () between cosine-similarity distributions among the source (), target (), and converted () utterances. Leakage is formalized as with , , and , where higher indicates greater susceptibility of the source speaker to identification; analyses span SPEECON and VCTK datasets across gender, age, accent, and environment mismatches. The results reveal a sliding-scale of information leakage influenced by mismatches (notably gender and certain accents), discuss potential artefacts from model bias, and argue for privacy-focused strategies and broader model testing to dampen leakage. The study highlights practical privacy implications for synthetic-voice services and calls for future work on robust disentanglement and multi-model evaluations.

Abstract

Using a multi-accented corpus of parallel utterances for use with commercial speech devices, we present a case study to show that it is possible to quantify a degree of confidence about a source speaker's identity in the case of one-to-one voice conversion. Following voice conversion using a HiFi-GAN vocoder, we compare information leakage for a range speaker characteristics; assuming a "worst-case" white-box scenario, we quantify our confidence to perform inference and narrow the pool of likely source speakers, reinforcing the regulatory obligation and moral duty that providers of synthetic voices have to ensure the privacy of their speakers' data.

Paper Structure

This paper contains 5 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: An illustration of the pipeline used in this research. Audio from source speaker $D$ and target speaker $P$ are disentangled into speaker embeddings, $F_0$ and speech content for each. The speaker embeddings of source speaker $\textsc{D}$ are replaced with those from target speaker $P$. Remaining disentangled representations are discarded. A HiFi-GAN trained on the LibriSpeech corpus is used to produce the voice-converted (VC) speech $P^\prime$. Cosine similarities are computed between all utterances of $P$, $D$ and $P^\prime$, forming the basis of our calculations for Earth Mover's Distance (EMD) and distributional similarities.
  • Figure 2: Three example scenarios, evaluated through our inference framework. Scenario 1 (top): it is not possible to confidently infer source speaker characteristics from the evidence distribution; information leakage is present, but more data are required to meet the confidence threshold. Scenario 2 (middle): no source speaker characteristics can be inferred; there is no interpretable data leakage. Scenario 3 (bottom): source speaker characteristics can be inferred; there is information leakage.
  • Figure 3: Six distributions from SPEECON, with the cosine similarities between the proximal target $P$, proximal source $D$, and the voice converted speech $P^\prime$ (replacing the speaker embeddings of $P$ with those of $D$) binned into 50 fixed-width intervals. The BLACK distributions are cos($P$,$D$); the RED distributions are cos($P^\prime$,$D$) and the GREEN distributions are cos($P^\prime$,$P$). Plots show the resulting distributional shifts following changes to speaker characteristic variables. Read left-to-right and top-to-bottom, we see RED increasingly move from GREEN towards BLACK: a 'sliding scale' of how the choice of source speaker characteristics results in greater (or lesser) interpretable data leakage.