Table of Contents
Fetching ...

PHONOS: PHOnetic Neutralization for Online Streaming Applications

Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah, Ricardo Gutierrez-Osuna

Abstract

Speaker anonymization (SA) systems modify timbre while leaving regional or non-native accents intact, which is problematic because accents can narrow the anonymity set. To address this issue, we present PHONOS, a streaming module for real-time SA that neutralizes non-native accent to sound native-like. Our approach pre-generates golden speaker utterances that preserve source timbre and rhythm but replace foreign segmentals with native ones using silence-aware DTW alignment and zero-shot voice conversion. These utterances supervise a causal accent translator that maps non-native content tokens to native equivalents with at most 40ms look-ahead, trained using joint cross-entropy and CTC losses. Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding space while having latency under 241 ms on single GPU.

PHONOS: PHOnetic Neutralization for Online Streaming Applications

Abstract

Speaker anonymization (SA) systems modify timbre while leaving regional or non-native accents intact, which is problematic because accents can narrow the anonymity set. To address this issue, we present PHONOS, a streaming module for real-time SA that neutralizes non-native accent to sound native-like. Our approach pre-generates golden speaker utterances that preserve source timbre and rhythm but replace foreign segmentals with native ones using silence-aware DTW alignment and zero-shot voice conversion. These utterances supervise a causal accent translator that maps non-native content tokens to native equivalents with at most 40ms look-ahead, trained using joint cross-entropy and CTC losses. Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding space while having latency under 241 ms on single GPU.

Paper Structure

This paper contains 16 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: PHONOS inference pipeline. Non-native speech is encoded into content tokens, accent-translated to native tokens, and decoded into a waveform conditioned on the original or pseudo-speaker embedding.
  • Figure 2: TVTSyn training workflow. (a) content encoder trained against HuBERT k-means pseudo-labels, and (b) wav decoder conditioned on speaker embedding trained with self-supervision and discriminator objectives.
  • Figure 3: Golden speaker generation. Native and non-native content embeddings are duration-aligned via silence-aware DTW, then synthesized with the non-native speaker's identity.
  • Figure 4: PHONOS's accent translator architecture. Non-native content tokens pass through ConvNeXt and limited-context transformer layers to produce native content tokens.