Table of Contents
Fetching ...

LeVo: High-Quality Song Generation with Multi-Preference Alignment

Shun Lei, Yaoxun Xu, Zhiwei Lin, Huaicheng Zhang, Wei Tan, Hangting Chen, Jianwei Yu, Yixuan Zhang, Chenyu Yang, Haina Zhu, Shuai Wang, Zhiyong Wu, Dong Yu

TL;DR

LeVo introduces a novel song-generation framework that parallelly models mixed tokens for vocal–instrument harmony and dual-track tokens for high-fidelity vocals and accompaniment using LeLM and Music Codec. A three-stage training regimen combined with a multi-preference alignment via Direct Preference Optimization (DPO) enables better Lyric Alignment, Prompt Consistency, and Musicality, addressing data quality and instruction-following challenges. Empirical results show LeVo surpassing open-source baselines and approaching industry systems on objective metrics, while achieving superior subjective quality across several dimensions, validated by ablations. The work advances controllable, high-quality long-form song generation, with implications for flexible conditioning and potential real-world music production tools, albeit with limitations in data quality and gaps to top proprietary systems.

Abstract

Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in audio quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, a language model based framework consisting of LeLM and Music Codec. LeLM is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve better vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following ability, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and post-training. Experimental results demonstrate that LeVo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs. Audio examples and source code are available at https://levo-demo.github.io and https://github.com/tencent-ailab/songgeneration.

LeVo: High-Quality Song Generation with Multi-Preference Alignment

TL;DR

LeVo introduces a novel song-generation framework that parallelly models mixed tokens for vocal–instrument harmony and dual-track tokens for high-fidelity vocals and accompaniment using LeLM and Music Codec. A three-stage training regimen combined with a multi-preference alignment via Direct Preference Optimization (DPO) enables better Lyric Alignment, Prompt Consistency, and Musicality, addressing data quality and instruction-following challenges. Empirical results show LeVo surpassing open-source baselines and approaching industry systems on objective metrics, while achieving superior subjective quality across several dimensions, validated by ablations. The work advances controllable, high-quality long-form song generation, with implications for flexible conditioning and potential real-world music production tools, albeit with limitations in data quality and gaps to top proprietary systems.

Abstract

Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in audio quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, a language model based framework consisting of LeLM and Music Codec. LeLM is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve better vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following ability, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and post-training. Experimental results demonstrate that LeVo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs. Audio examples and source code are available at https://levo-demo.github.io and https://github.com/tencent-ailab/songgeneration.

Paper Structure

This paper contains 53 sections, 2 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: The overview of LeVo, a song generation framework based on lyrics, optional text descriptions, and optional audio prompts. It consists of LeLM and Music Codec.
  • Figure 2: The architecture of LeLM, which consists of a language model and an AR decoder.
  • Figure 3: The framework of the Music Codec in LeVo.
  • Figure 4: The screenshot of MOS test in overall quality.
  • Figure 5: The screenshot of MOS test in vocal melodic attractiveness.
  • ...and 4 more figures