Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Ye Bai; Haonan Chen; Jitong Chen; Zhuo Chen; Yi Deng; Xiaohong Dong; Lamtharn Hantrakul; Weituo Hao; Qingqing Huang; Zhongyi Huang; Dongya Jia; Feihu La; Duc Le; Bochen Li; Chumin Li; Hui Li; Xingxing Li; Shouda Liu; Wei-Tsung Lu; Yiqing Lu; Andrew Shaw; Janne Spijkervet; Yakun Sun; Bo Wang; Ju-Chiang Wang; Yuping Wang; Yuxuan Wang; Ling Xu; Yifeng Yang; Chao Yao; Shuo Zhang; Yang Zhang; Yilin Zhang; Hang Zhao; Ziyi Zhao; Dejian Zhong; Shicen Zhou; Pei Zou

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Ye Bai, Haonan Chen, Jitong Chen, Zhuo Chen, Yi Deng, Xiaohong Dong, Lamtharn Hantrakul, Weituo Hao, Qingqing Huang, Zhongyi Huang, Dongya Jia, Feihu La, Duc Le, Bochen Li, Chumin Li, Hui Li, Xingxing Li, Shouda Liu, Wei-Tsung Lu, Yiqing Lu, Andrew Shaw, Janne Spijkervet, Yakun Sun, Bo Wang, Ju-Chiang Wang, Yuping Wang, Yuxuan Wang, Ling Xu, Yifeng Yang, Chao Yao, Shuo Zhang, Yang Zhang, Yilin Zhang, Hang Zhao, Ziyi Zhao, Dejian Zhong, Shicen Zhou, Pei Zou

TL;DR

Seed-Music tackles the challenge of high-quality vocal music generation with adjustable style control and editing capabilities. It presents a unified framework that combines auto-regressive language models and diffusion, supporting three intermediate representations: audio tokens, lead sheet tokens, and vocoder latents. The paper details three pipelines—audio token-based, symbolic lead-sheet-based, and vocoder latent-based—along with training stages and reinforcement learning to align outputs with prompts, plus diffusion-based editing and zero-shot singing voice conversion. Experiments across Lyrics2Song, Lyrics2Leadsheet2Song, MusicEDiT, and zero-shot VC demonstrate strong control, multi-modal conditioning, and practical editing workflows, with careful attention to ethics and safety.

Abstract

We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and post-production editing. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio. We encourage readers to listen to demo audio examples at https://team.doubao.com/seed-music "https://team.doubao.com/seed-music".

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

TL;DR

Abstract

Paper Structure (25 sections, 5 figures, 1 table)

This paper contains 25 sections, 5 figures, 1 table.

Introduction
Contributions
Literature Review
Symbolic music based systems.
Audio rendering for symbolic music based systems.
Language model based generative approaches.
Diffusion-based generative models.
Method
Audio Token-based Pipeline
Audio tokenizer.
Generator.
Renderer.
Symbolic Token-based Pipeline
Vocoder Latent-based Pipeline
Model Training and Inference
...and 10 more sections

Figures (5)

Figure 1: An overview of Seed-Music framework.
Figure 2: Overview of the Seed-Music pipeline with audio token as intermediate representation. (1) Input embedders convert multi-modal controlling inputs, such as music style description, lyrics, reference audio, or music scores, into a prefix embedding sequence. (2) The auto-regressive LM generates a sequence of audio tokens. (3) The diffusion transformer model generates continuous vocoder latents. (4) The acoustic vocoder produces high-quality 44.1kHz stereo audio.
Figure 3: Overview of the pipeline using symbolic tokens as the intermediate representation. (1) Conditioned on the user prompt, the auto-regressive LM generates the symbolic tokens corresponding to a lead sheet. (2) The diffusion transformer model generates continuous vocoder latents given the symbolic tokens. (3) The vocoder then generates the high-quality 44.1KHz stereo audio waveform.
Figure 4: Seed-Music pipeline with vocoder latents as intermediate representation. (1) Various input types are fed into DiT via cross-attention, prefix, or temporal conditioning. (2) The diffusion transformer model predicts the continuous vocoder latents. (3) The acoustic vocoder then produces high-quality 44.1kHz stereo audio.
Figure 5: Illustration of the REMI-style symbolic music encoding scheme.

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

TL;DR

Abstract

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)