Table of Contents
Fetching ...

Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation

Xueyao Zhang, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, Zhizheng Wu

TL;DR

Vevo2 tackles the challenge of controllable voice generation in speech and singing by introducing two unified audio tokenizers and a two-stage AR+FM architecture. It enables flexible control over text, prosody/melody, style, timbre, duration, and pitch through EPL/IPL joint training and a multi-objective post-training regime guided by GRPO. The approach yields mutual benefits across speech and singing, achieving strong performance in synthesis, conversion, and editing tasks and enabling novel applications like humming-to-singing and instrument-to-singing. This framework demonstrates that unified training across expressive domains can enhance both data efficiency and controllability for diverse audio generation tasks.

Abstract

Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a unified music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a unified content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during the speech-singing joint training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the Vevo2's ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. Additionally, Vevo2's effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility. Audio samples are are available at https://versasinger.github.io/.

Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation

TL;DR

Vevo2 tackles the challenge of controllable voice generation in speech and singing by introducing two unified audio tokenizers and a two-stage AR+FM architecture. It enables flexible control over text, prosody/melody, style, timbre, duration, and pitch through EPL/IPL joint training and a multi-objective post-training regime guided by GRPO. The approach yields mutual benefits across speech and singing, achieving strong performance in synthesis, conversion, and editing tasks and enabling novel applications like humming-to-singing and instrument-to-singing. This framework demonstrates that unified training across expressive domains can enhance both data efficiency and controllability for diverse audio generation tasks.

Abstract

Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a unified music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a unified content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during the speech-singing joint training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the Vevo2's ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. Additionally, Vevo2's effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility. Audio samples are are available at https://versasinger.github.io/.

Paper Structure

This paper contains 36 sections, 7 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Vevo2 inference pipeline for versatile synthesis, conversion, and editing tasks.
  • Figure 2:
  • Figure 3: Speech-Singing Joint Training with Explicit Prosody Learning (EPL) and Implicit Prosody Learning (IPL). We perform the next token prediction only on the sequence of content-style tokens (see Section \ref{['sec:pre-training']} for more details).
  • Figure 4: Multi-objective alignment for both intelligibility and prosody similarity. This figure demonstrates how we utilize instrumental music as prosody prompts during post-training of Vevo2.
  • Figure 5: Effect of intelligibility reward (w/ Intell) and prosody similarity reward (w/ Prosody) for post-training. The right figure presents subjective evaluation results on the instrument-to-sing task.