Table of Contents
Fetching ...

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Ruibin Yuan, Hanfeng Lin, Shuyue Guo, Ge Zhang, Jiahao Pan, Yongyi Zang, Haohe Liu, Yiming Liang, Wenye Ma, Xingjian Du, Xinrun Du, Zhen Ye, Tianyu Zheng, Zhengxuan Jiang, Yinghao Ma, Minghao Liu, Zeyue Tian, Ziya Zhou, Liumeng Xue, Xingwei Qu, Yizhi Li, Shangda Wu, Tianhao Shen, Ziyang Ma, Jun Zhan, Chunhui Wang, Yatian Wang, Xiaowei Chi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Shansong Liu, Lingrui Mei, Peng Li, Junjie Wang, Jianwei Yu, Guojian Pang, Xu Li, Zihao Wang, Xiaohuan Zhou, Lijun Yu, Emmanouil Benetos, Yong Chen, Chenghua Lin, Xie Chen, Gus Xia, Zhaoxiang Zhang, Chao Zhang, Wenhu Chen, Xinyu Zhou, Xipeng Qiu, Roger Dannenberg, Jiaheng Liu, Jian Yang, Wenhao Huang, Wei Xue, Xu Tan, Yike Guo

TL;DR

YuE presents an open-source, two-stage foundation-model approach for long-form lyrics-to-song generation, combining track-decoupled NTP, structural segment conditioning, and music-tailored in-context learning to produce coherent, five-minute songs with expressive vocals. A multitask, multiphase pre-training regime, along with a semantic-acoustic fused audio codec and a residual Stage-2 model, enables scalable training and high-quality audio reconstruction. Empirical results show YuE competitive with proprietary systems on musicality, controllability, and multilingual lyrics-following, while extending capabilities in representation learning and music understanding. The work also provides detailed ablations, multilingual fine-tuning, and analysis of memorization, tokenizers, and test-time strategies, highlighting both the promise and limitations of open, large-scale music foundation models.

Abstract

We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation

YuE: Scaling Open Foundation Models for Long-Form Music Generation

TL;DR

YuE presents an open-source, two-stage foundation-model approach for long-form lyrics-to-song generation, combining track-decoupled NTP, structural segment conditioning, and music-tailored in-context learning to produce coherent, five-minute songs with expressive vocals. A multitask, multiphase pre-training regime, along with a semantic-acoustic fused audio codec and a residual Stage-2 model, enables scalable training and high-quality audio reconstruction. Empirical results show YuE competitive with proprietary systems on musicality, controllability, and multilingual lyrics-following, while extending capabilities in representation learning and music understanding. The work also provides detailed ablations, multilingual fine-tuning, and analysis of memorization, tokenizers, and test-time strategies, highlighting both the promise and limitations of open, large-scale music foundation models.

Abstract

We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation

Paper Structure

This paper contains 80 sections, 19 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: The General Application of YuE. The YuE model takes meta information and lyrics of the generated song in text and arbitrary audio as condition. The model can control outputs in multiple dimensions such as genre, emotion and languages.
  • Figure 2: Overview of YuE framework: two-stage lyrics-to-song generation with audio/text tokenizers and two language models. Stage-1: music language modeling. Stage-2: residual modeling. Blue: vocal tokens. Orange: accompaniment tokens. Grey: residual tokens.
  • Figure 3: $\Delta\text{WER}$ across different music genres for mixture / vocal-only tracks. $\Delta\text{WER}\propto\text{LLAT}$.
  • Figure 4: The Stage-1 Framework of YuE. Dotted lines: Dual-NTP (Section \ref{['sec:dual-ntp']}). Text interleave: CoT (Section \ref{['sec:cot']}). Green tokens: ICL (Section \ref{['sec:icl']}). Multitask learning (Section \ref{['sec:multitask_multiphase']}).
  • Figure 5: Stage-2 Framework of YuE. $S$: <SOA>, $S$=<SOA>, $E$=<EOA>, $S_i$=<stage_i>.
  • ...and 12 more figures