Table of Contents
Fetching ...

XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework

Sida Tian, Can Zhang, Wei Yuan, Wei Tan, Wenjie Zhu

TL;DR

XMusic tackles the challenge of controllable, high-quality symbolic music generation by unifying multi-modal prompts into a symbolic projection space via XProjector and generating music with a Transformer-based Generator and a multi-tasked Selector. It introduces an enhanced symbolic representation and a large-scale XMIDI dataset to enable precise emotion and genre control, achieving superior objective and subjective performance compared to state-of-the-art methods. The approach supports prompts from images, videos, text, tags, and humming, and demonstrates the ability to fine-grain emotion control at the bar level, with strong humming-controllability and cross-modal transfer. This framework advances symbolic music research by decoupling control signal parsing from generation and providing scalable, flexible integration of new modalities for practical applications.

Abstract

In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine Highlights of Collectibles at WAIC 2023. The project homepage of XMusic is https://xmusic-project.github.io.

XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework

TL;DR

XMusic tackles the challenge of controllable, high-quality symbolic music generation by unifying multi-modal prompts into a symbolic projection space via XProjector and generating music with a Transformer-based Generator and a multi-tasked Selector. It introduces an enhanced symbolic representation and a large-scale XMIDI dataset to enable precise emotion and genre control, achieving superior objective and subjective performance compared to state-of-the-art methods. The approach supports prompts from images, videos, text, tags, and humming, and demonstrates the ability to fine-grain emotion control at the bar level, with strong humming-controllability and cross-modal transfer. This framework advances symbolic music research by decoupling control signal parsing from generation and providing scalable, flexible integration of new modalities for practical applications.

Abstract

In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine Highlights of Collectibles at WAIC 2023. The project homepage of XMusic is https://xmusic-project.github.io.
Paper Structure (43 sections, 12 equations, 4 figures, 11 tables)

This paper contains 43 sections, 12 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: The architectural overview of our XMusic framework. It contains two essential components: XProjector and XComposer. XProjector parses various input prompts into specific symbolic music elements. These elements then serve as control signals, guiding the music generation process within the Generator of XComposer. Additionally, XComposer includes a Selector that evaluates and identifies high-quality generated music. The Generator is trained on our large-scale dataset, XMIDI, which includes precise emotion and genre labels.
  • Figure 2: Illustration of the proposed XMusic, which supports flexible (a) X-Prompts to guide the generation of high-quality symbolic music. The XProjector analyzes these prompts, mapping them to symbolic music elements within the (b) Projection Space. Subsequently, the (c) Generator of XComposer transforms these symbolic music elements into token sequences based on our enhanced representation. It employs a Transformer Decoder as the generative model to predict successive events iteratively, thereby creating complete musical compositions. Finally, the (d) Selector of XComposer utilizes a Transformer Encoder to encode the complete token sequences and employs a multi-task learning scheme to evaluate the quality of the generated music.
  • Figure 3: Comparison between our representation and Compound Word (CP) hsiao2021compound representation. The dotted boxes represent our new tokens in comparison with those of the CP representation.
  • Figure 4: Data statistics of our XMIDI dataset.