Table of Contents
Fetching ...

Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation

Yishan Lv, Jing Luo, Boyuan Ju, Xinyu Yang

TL;DR

This study delve into the multi-level structures within music from macro-level and micro-level hierarchies, and proposes a novel Phrase-level Cross-Attention mechanism to capture the intrinsic relationship between macro-level hierarchy and micro-level hierarchy.

Abstract

Recently, symbolic music generation has become a focus of numerous deep learning research. Structure as an important part of music, contributes to improving the quality of music, and an increasing number of works start to study the hierarchical structure. In this study, we delve into the multi-level structures within music from macro-level and micro-level hierarchies. At the macro-level hierarchy, we conduct phrase segmentation algorithm to explore how phrases influence the overall development of music, and at the micro-level hierarchy, we design skeleton notes extraction strategy to explore how skeleton notes within each phrase guide the melody generation. Furthermore, we propose a novel Phrase-level Cross-Attention mechanism to capture the intrinsic relationship between macro-level hierarchy and micro-level hierarchy. Moreover, in response to the current lack of research on Chinese-style music, we construct our Small Tunes Dataset: a substantial collection of MIDI files comprising 10088 Small Tunes, a category of traditional Chinese Folk Songs. This dataset serves as the focus of our study. We generate Small Tunes songs utilizing the extracted skeleton notes as conditions, and experiment results indicate that our proposed model, Small Tunes Transformer, outperforms other state-of-the-art models. Besides, we design three novel objective evaluation metrics to evaluate music from both rhythm and melody dimensions.

Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation

TL;DR

This study delve into the multi-level structures within music from macro-level and micro-level hierarchies, and proposes a novel Phrase-level Cross-Attention mechanism to capture the intrinsic relationship between macro-level hierarchy and micro-level hierarchy.

Abstract

Recently, symbolic music generation has become a focus of numerous deep learning research. Structure as an important part of music, contributes to improving the quality of music, and an increasing number of works start to study the hierarchical structure. In this study, we delve into the multi-level structures within music from macro-level and micro-level hierarchies. At the macro-level hierarchy, we conduct phrase segmentation algorithm to explore how phrases influence the overall development of music, and at the micro-level hierarchy, we design skeleton notes extraction strategy to explore how skeleton notes within each phrase guide the melody generation. Furthermore, we propose a novel Phrase-level Cross-Attention mechanism to capture the intrinsic relationship between macro-level hierarchy and micro-level hierarchy. Moreover, in response to the current lack of research on Chinese-style music, we construct our Small Tunes Dataset: a substantial collection of MIDI files comprising 10088 Small Tunes, a category of traditional Chinese Folk Songs. This dataset serves as the focus of our study. We generate Small Tunes songs utilizing the extracted skeleton notes as conditions, and experiment results indicate that our proposed model, Small Tunes Transformer, outperforms other state-of-the-art models. Besides, we design three novel objective evaluation metrics to evaluate music from both rhythm and melody dimensions.

Paper Structure

This paper contains 17 sections, 11 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Two multi-level hierarchies: phrase & bar-level hierarchies (left) and our macro & micro-level hierarchies (right). The dashed boxes indicate levels that are not considered in the respective hierarchies.
  • Figure 2: (a) An example of music representation: For instance, the first note will be represented as ($77$, $240$, $1$). (b) A piece of Molihua, a famous Chinese Folk Song. The blue-colored $A4$ note, followed by a passing note and returning to $A4$, will be selected as a Small Tunes Trembling Note.
  • Figure 3: An example of skeleton extraction. The skeleton notes consist of Small Tunes Trembling Note, Metrical Accent, Syncopation and Long Note.
  • Figure 4: Phrase-level Mask Matrix (left) and attention weights (right)
  • Figure 5: Architecture of Small Tunes Transformer
  • ...and 1 more figures