Table of Contents
Fetching ...

VibeVoice Technical Report

Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei

TL;DR

VibeVoice tackles scalable synthesis of long-form, multi-speaker conversational speech. It combines ultra-low-frame-rate acoustic and semantic tokenizers with a next-token diffusion framework guided by an LLM, enabling up to 90 minutes of speech from up to four speakers. The system achieves state-of-the-art results on long-form benchmarks and generalizes to short utterances, aided by curriculum learning and CFG-based diffusion. While promising for research and applications like podcasts, it notes language limitations, lack of overlapping speech handling, and safety risks around deepfakes.

Abstract

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

VibeVoice Technical Report

TL;DR

VibeVoice tackles scalable synthesis of long-form, multi-speaker conversational speech. It combines ultra-low-frame-rate acoustic and semantic tokenizers with a next-token diffusion framework guided by an LLM, enabling up to 90 minutes of speech from up to four speakers. The system achieves state-of-the-art results on long-form benchmarks and generalizes to short utterances, aided by curriculum learning and CFG-based diffusion. While promising for research and applications like podcasts, it notes language limitations, lack of overlapping speech handling, and safety risks around deepfakes.

Abstract

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

Paper Structure

This paper contains 9 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: VibeVoice is capable of synthesizing 5,000+ seconds of audio while consistently outperforming strong open/closed-source systems in subjective evaluations of preference, realism, and richness.
  • Figure 2: VibeVoice employs next token diffusion framework as in LatentLM latentlm to synthesize long-form and multi-speaker audios. Voice prompts and text scripts provide initial input. VibeVoice processes hybrid context features, and its hidden states condition a token level Diffusion Head (D), which predicts acoustic VAE for speech segments, subsequently recovered by acoustic decoder (A).