VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Zhisheng Zheng; Puyuan Peng; Anuj Diwan; Cong Phuoc Huynh; Xiaohang Sun; Zhu Liu; Vimal Bhat; David Harwath

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, David Harwath

TL;DR

VoiceCraft-X addresses the lack of a unified multilingual system that can both edit and synthesize speech by treating editing and zero-shot TTS as a single sequence-generation problem over neural codec tokens. It leverages the Qwen3 LLM for cross-lingual text processing, employs a novel time-aligned token reordering, and uses EnCodec with four codebooks ($K=4$) to tokenize speech for autoregressive generation. The model demonstrates robust performance across 11 languages, including low-resource ones, with strong transfer learning benefits and effective multilingual editing capabilities, all while maintaining high naturalness and intelligibility in both editing and synthesis tasks. This approach promises practical impact for multilingual voice assistants and media workflows, while also highlighting ethical considerations and a responsible release plan. Overall, VoiceCraft-X provides a compelling direction for unified, data-efficient multilingual speech generation in real-world applications.

Abstract

We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

TL;DR

Abstract

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)