Table of Contents
Fetching ...

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, David Harwath

TL;DR

VoiceCraft-X addresses the lack of a unified multilingual system that can both edit and synthesize speech by treating editing and zero-shot TTS as a single sequence-generation problem over neural codec tokens. It leverages the Qwen3 LLM for cross-lingual text processing, employs a novel time-aligned token reordering, and uses EnCodec with four codebooks ($K=4$) to tokenize speech for autoregressive generation. The model demonstrates robust performance across 11 languages, including low-resource ones, with strong transfer learning benefits and effective multilingual editing capabilities, all while maintaining high naturalness and intelligibility in both editing and synthesis tasks. This approach promises practical impact for multilingual voice assistants and media workflows, while also highlighting ethical considerations and a responsible release plan. Overall, VoiceCraft-X provides a compelling direction for unified, data-efficient multilingual speech generation in real-world applications.

Abstract

We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

TL;DR

VoiceCraft-X addresses the lack of a unified multilingual system that can both edit and synthesize speech by treating editing and zero-shot TTS as a single sequence-generation problem over neural codec tokens. It leverages the Qwen3 LLM for cross-lingual text processing, employs a novel time-aligned token reordering, and uses EnCodec with four codebooks () to tokenize speech for autoregressive generation. The model demonstrates robust performance across 11 languages, including low-resource ones, with strong transfer learning benefits and effective multilingual editing capabilities, all while maintaining high naturalness and intelligibility in both editing and synthesis tasks. This approach promises practical impact for multilingual voice assistants and media workflows, while also highlighting ethical considerations and a responsible release plan. Overall, VoiceCraft-X provides a compelling direction for unified, data-efficient multilingual speech generation in real-world applications.

Abstract

We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

Paper Structure

This paper contains 41 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Architecture Overview. This diagram illustrates the training process for the VoiceCraft-X model. The model takes text and a speaker embedding as input and is trained to predict sequences of speech tokens. The labels CB1-CB4 represent codec tokens from different codebooks.
  • Figure 2: Illustration of Token Reordering
  • Figure 3: Relationship between per-language fine-tuning data and zero-shot TTS quality. Each point represents a target language, positioned by the number of hours used to fine-tune VoiceCraft-X (x-axis) and the relative Word Error Rate – the difference between Whisper's WER on synthesized audio and its WER on ground-truth audio.
  • Figure 4: SMOS Annotation UI
  • Figure 5: CMOS Annotation UI
  • ...and 2 more figures