Table of Contents
Fetching ...

TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah, Ricardo Gutierrez-Osuna

TL;DR

TVTSyn addresses privacy-preserving streaming VC and anonymization by aligning content dynamics with a time-varying timbre representation. It introduces Global Timbre Memory and a time-varying timbre path with gating and spherical interpolation, plus a factorized VQ bottleneck to reduce residual speaker leakage, enabling fully causal synthesis with under 80 ms latency. Across VC and SA tasks under the VoicePrivacy Challenge, TVTSyn achieves superior privacy–utility trade-offs relative to state-of-the-art streaming baselines, as shown by favorable EER, WER, MOS, and identity-preservation metrics. The work offers a scalable framework for real-time, privacy-preserving, and expressive speech synthesis, with avenues for controllable anonymization and cross-lingual robustness.

Abstract

Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with <80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.

TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

TL;DR

TVTSyn addresses privacy-preserving streaming VC and anonymization by aligning content dynamics with a time-varying timbre representation. It introduces Global Timbre Memory and a time-varying timbre path with gating and spherical interpolation, plus a factorized VQ bottleneck to reduce residual speaker leakage, enabling fully causal synthesis with under 80 ms latency. Across VC and SA tasks under the VoicePrivacy Challenge, TVTSyn achieves superior privacy–utility trade-offs relative to state-of-the-art streaming baselines, as shown by favorable EER, WER, MOS, and identity-preservation metrics. The work offers a scalable framework for real-time, privacy-preserving, and expressive speech synthesis, with avenues for controllable anonymization and cross-lingual robustness.

Abstract

Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with <80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.
Paper Structure (32 sections, 3 equations, 5 figures, 7 tables)

This paper contains 32 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) The content encoder in TVTSyn is trained separately with supervision from an off-line HuBERT model. (b) The waveform decoder is trained in a self-supervised fashion to reconstruct the input utterance from content and speaker embedding streams. Dashed lines are disabled at inference.
  • Figure 2: Architecture details for (a) TVT processing block, (b) waveform decoder.
  • Figure 3: t-SNE visualization of content embeddings, color-coded by speaker. Markers denote native ($\blacklozenge$) or non-native ($\circ$). (a) Continuous embeddings, (b) logits, (c) bottleneck, and (d) VQ bottleneck.
  • Figure 4: Qualitative analysis of time-varying timbre for the text: "Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.". (a) Content-GTM attention map with (b) Top-1 strip shows content-dependent selection of timbre facets, (c) PCA trajectories (pre-slerp vs. final), (d) PCA projection of GTM value tokens (size $\propto$ usage) and (e) token-usage histogram indicate diverse, non-collapsed facets.
  • Figure 5: Objective evaluation results for voice conversion. Src-SIM: cosine similarity b/w VC and source speaker; Trg-SIM: cosine similarity b/w VC and target speaker; NISQA-MOS: Speech Quality and Naturalness Assessment. Src-SIM and Trg-SIM for source speech (i.e., unaltered) reflect within- and between-speaker similarity, respectively.