Table of Contents
Fetching ...

Yin-Yang: Developing Motifs With Long-Term Structure And Controllability

Keshav Bhandari, Geraint A. Wiggins, Simon Colton

TL;DR

The paper addresses how to extend a short motif into long-form, structurally coherent melodies using autoregressive transformers. It introduces Yin-Yang, a three-model framework consisting of a phrase generator, a phrase refiner, and a phrase selector, with a corruption-refinement training regime and a generation ratio $G:R$ guiding development. Evaluation on three monophonic folk datasets uses a Structure Derivation (SD) metric alongside the Vendi score to quantify motif derivation and output diversity, showing the approach outperforms Music Transformer and Compound Word Transformer baselines. The framework yields controllable, semi-interpretable long-range music generation and sets the stage for extensions to polyphonic data and richer conditioning.

Abstract

Transformer models have made great strides in generating symbolically represented music with local coherence. However, controlling the development of motifs in a structured way with global form remains an open research area. One of the reasons for this challenge is due to the note-by-note autoregressive generation of such models, which lack the ability to correct themselves after deviations from the motif. In addition, their structural performance on datasets with shorter durations has not been studied in the literature. In this study, we propose Yin-Yang, a framework consisting of a phrase generator, phrase refiner, and phrase selector models for the development of motifs into melodies with long-term structure and controllability. The phrase refiner is trained on a novel corruption-refinement strategy which allows it to produce melodic and rhythmic variations of an original motif at generation time, thereby rectifying deviations of the phrase generator. We also introduce a new objective evaluation metric for quantifying how smoothly the motif manifests itself within the piece. Evaluation results show that our model achieves better performance compared to state-of-the-art transformer models while having the advantage of being controllable and making the generated musical structure semi-interpretable, paving the way for musical analysis. Our code and demo page can be found at https://github.com/keshavbhandari/yinyang.

Yin-Yang: Developing Motifs With Long-Term Structure And Controllability

TL;DR

The paper addresses how to extend a short motif into long-form, structurally coherent melodies using autoregressive transformers. It introduces Yin-Yang, a three-model framework consisting of a phrase generator, a phrase refiner, and a phrase selector, with a corruption-refinement training regime and a generation ratio guiding development. Evaluation on three monophonic folk datasets uses a Structure Derivation (SD) metric alongside the Vendi score to quantify motif derivation and output diversity, showing the approach outperforms Music Transformer and Compound Word Transformer baselines. The framework yields controllable, semi-interpretable long-range music generation and sets the stage for extensions to polyphonic data and richer conditioning.

Abstract

Transformer models have made great strides in generating symbolically represented music with local coherence. However, controlling the development of motifs in a structured way with global form remains an open research area. One of the reasons for this challenge is due to the note-by-note autoregressive generation of such models, which lack the ability to correct themselves after deviations from the motif. In addition, their structural performance on datasets with shorter durations has not been studied in the literature. In this study, we propose Yin-Yang, a framework consisting of a phrase generator, phrase refiner, and phrase selector models for the development of motifs into melodies with long-term structure and controllability. The phrase refiner is trained on a novel corruption-refinement strategy which allows it to produce melodic and rhythmic variations of an original motif at generation time, thereby rectifying deviations of the phrase generator. We also introduce a new objective evaluation metric for quantifying how smoothly the motif manifests itself within the piece. Evaluation results show that our model achieves better performance compared to state-of-the-art transformer models while having the advantage of being controllable and making the generated musical structure semi-interpretable, paving the way for musical analysis. Our code and demo page can be found at https://github.com/keshavbhandari/yinyang.

Paper Structure

This paper contains 16 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The phrase refiner encoder takes in phrase 1 along with the conditional tokens and the corrupted phrase 2 to generate the clean phrase 2 with its decoder. The conditional tokens include the type of corruption, key and time signatures, along with phrase length and cadence of the phrase.
  • Figure 2: Semi-interpretable examples of refined transformations of a given 2 bar motif: Fragmentation retains the red fragment while the green notes vary. For retrograde, an incorrect inversion corruption produces a retrograded motivic variation. In inversion, the generated pitch contour inversely relates to the motif in a non-strict way while adhering to prior context as seen with the yellow arrow.
  • Figure 3: Semi-interpretable generation framework of Yin-Yang with 2 sections AB consisting of 3 and 4 phrases respectively.
  • Figure 4: Boxplot of overall MOS ratings by model group