Table of Contents
Fetching ...

MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit

Yutian Wang, Wanyin Yang, Zhenrong Dai, Yilong Zhang, Kun Zhao, Hui Wang

TL;DR

This paper develops the POP909$\_$M dataset, the first to include labels for musical motifs and their variants, providing a basis for mimicking human compositional habits and proposes MeloTrans, a text-to-music composition model that employs principles of motif development rules.

Abstract

At present, neural network models show powerful sequence prediction ability and are used in many automatic composition models. In comparison, the way humans compose music is very different from it. Composers usually start by creating musical motifs and then develop them into music through a series of rules. This process ensures that the music has a specific structure and changing pattern. However, it is difficult for neural network models to learn these composition rules from training data, which results in a lack of musicality and diversity in the generated music. This paper posits that integrating the learning capabilities of neural networks with human-derived knowledge may lead to better results. To archive this, we develop the POP909$\_$M dataset, the first to include labels for musical motifs and their variants, providing a basis for mimicking human compositional habits. Building on this, we propose MeloTrans, a text-to-music composition model that employs principles of motif development rules. Our experiments demonstrate that MeloTrans excels beyond existing music generation models and even surpasses Large Language Models (LLMs) like ChatGPT-4. This highlights the importance of merging human insights with neural network capabilities to achieve superior symbolic music generation.

MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit

TL;DR

This paper develops the POP909M dataset, the first to include labels for musical motifs and their variants, providing a basis for mimicking human compositional habits and proposes MeloTrans, a text-to-music composition model that employs principles of motif development rules.

Abstract

At present, neural network models show powerful sequence prediction ability and are used in many automatic composition models. In comparison, the way humans compose music is very different from it. Composers usually start by creating musical motifs and then develop them into music through a series of rules. This process ensures that the music has a specific structure and changing pattern. However, it is difficult for neural network models to learn these composition rules from training data, which results in a lack of musicality and diversity in the generated music. This paper posits that integrating the learning capabilities of neural networks with human-derived knowledge may lead to better results. To archive this, we develop the POP909M dataset, the first to include labels for musical motifs and their variants, providing a basis for mimicking human compositional habits. Building on this, we propose MeloTrans, a text-to-music composition model that employs principles of motif development rules. Our experiments demonstrate that MeloTrans excels beyond existing music generation models and even surpasses Large Language Models (LLMs) like ChatGPT-4. This highlights the importance of merging human insights with neural network capabilities to achieve superior symbolic music generation.

Paper Structure

This paper contains 20 sections, 8 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Comparison of (a) human composition process and (b) neural network model composition process.
  • Figure 2: Demo of data organization in POP909_M. In the motif and variants tracks, the motif and variants in the corresponding melody track are marked with horizontal lines.
  • Figure 3: Architecture of TTMM.
  • Figure 4: Architecture of MGM.
  • Figure 5: The architecture of the decoder in MGM.
  • ...and 4 more figures