FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec
Yurii Halychanskyi, Cameron Churchwell, Yutong Wen, Volodymyr Kindratenko
TL;DR
FAC-FACodec introduces an explicit, user-controllable degree of accent modification for foreign accent conversion by learning a native-pronunciation prior over FACodec content latents through diffusion and applying inference-time noise and denoising. The framework operates with non-parallel data, using native speech with transcripts for training and non-native speech with transcripts for inference, and concentrates on pronunciation while preserving suprasegmental cues. Objective and subjective evaluations on L2-Arctic demonstrate competitive performance and a smooth continuum of accent strength guided by the control knob $t_{\text{start}}$, highlighting practical applicability in learning, dubbing, and personal communication. The approach offers a unique capability among prior AC methods by enabling tunable trade-offs between accent conversion and speaker identity, with future work targeting perceptually guided noise and coordinated prosody control.
Abstract
Previous accent conversion (AC) methods, including foreign accent conversion (FAC), lack explicit control over the degree of modification. Because accent modification can alter the perceived speaker identity, balancing conversion strength and identity preservation is crucial. We present an AC framework that provides an explicit, user-controllable parameter for accent modification. The method targets pronunciation while preserving suprasegmental cues such as intonation and phoneme durations. Results show performance comparable to recent AC systems, stronger preservation of speaker identity, and unique support for controllable accent conversion.
