Table of Contents
Fetching ...

Generating High-quality Symbolic Music Using Fine-grained Discriminators

Zhedong Zhang, Liang Li, Jiehua Zhang, Zhenghui Hu, Hongkui Wang, Chenggang Yan, Jian Yang, Yuankai Qi

TL;DR

This work addresses the challenge of generating high-quality symbolic music by moving beyond a single global discriminator to fine-grained discriminators that separately target melody and rhythm. A melody–rhythm decoupling module masks relevant tokens, enabling targeted feedback, while a pitch augmentation strategy and a bar-level relative positional encoding enhance discrimination for their respective domains. The generator is a seq2seq symbolic music transformer trained with a combined likelihood and adversarial objective, guided by both discriminators. Experiments on POP909 show improvements across objective metrics and subjective listening tests, with MIDI-BERT similarity reaching the closest alignment to human compositions, indicating stronger musicality and style. Overall, the method demonstrates that fine-grained, domain-specific adversarial feedback can significantly improve the quality and realism of generated symbolic music.

Abstract

Existing symbolic music generation methods usually utilize discriminator to improve the quality of generated music via global perception of music. However, considering the complexity of information in music, such as rhythm and melody, a single discriminator cannot fully reflect the differences in these two primary dimensions of music. In this work, we propose to decouple the melody and rhythm from music, and design corresponding fine-grained discriminators to tackle the aforementioned issues. Specifically, equipped with a pitch augmentation strategy, the melody discriminator discerns the melody variations presented by the generated samples. By contrast, the rhythm discriminator, enhanced with bar-level relative positional encoding, focuses on the velocity of generated notes. Such a design allows the generator to be more explicitly aware of which aspects should be adjusted in the generated music, making it easier to mimic human-composed music. Experimental results on the POP909 benchmark demonstrate the favorable performance of the proposed method compared to several state-of-the-art methods in terms of both objective and subjective metrics.

Generating High-quality Symbolic Music Using Fine-grained Discriminators

TL;DR

This work addresses the challenge of generating high-quality symbolic music by moving beyond a single global discriminator to fine-grained discriminators that separately target melody and rhythm. A melody–rhythm decoupling module masks relevant tokens, enabling targeted feedback, while a pitch augmentation strategy and a bar-level relative positional encoding enhance discrimination for their respective domains. The generator is a seq2seq symbolic music transformer trained with a combined likelihood and adversarial objective, guided by both discriminators. Experiments on POP909 show improvements across objective metrics and subjective listening tests, with MIDI-BERT similarity reaching the closest alignment to human compositions, indicating stronger musicality and style. Overall, the method demonstrates that fine-grained, domain-specific adversarial feedback can significantly improve the quality and realism of generated symbolic music.

Abstract

Existing symbolic music generation methods usually utilize discriminator to improve the quality of generated music via global perception of music. However, considering the complexity of information in music, such as rhythm and melody, a single discriminator cannot fully reflect the differences in these two primary dimensions of music. In this work, we propose to decouple the melody and rhythm from music, and design corresponding fine-grained discriminators to tackle the aforementioned issues. Specifically, equipped with a pitch augmentation strategy, the melody discriminator discerns the melody variations presented by the generated samples. By contrast, the rhythm discriminator, enhanced with bar-level relative positional encoding, focuses on the velocity of generated notes. Such a design allows the generator to be more explicitly aware of which aspects should be adjusted in the generated music, making it easier to mimic human-composed music. Experimental results on the POP909 benchmark demonstrate the favorable performance of the proposed method compared to several state-of-the-art methods in terms of both objective and subjective metrics.
Paper Structure (11 sections, 5 equations, 6 figures, 2 tables)

This paper contains 11 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: (a) Main structure of conventional GAN-based method with coarse-grained global discriminator. (b) The structure of proposed fine-grained discriminators architecture.
  • Figure 2: Main framework of the proposed symbolic music generation model, consists of three main components: a music generator and two fine-grained discriminators --- rhythm discriminator and melody discriminator.
  • Figure 3: Illustration of the proposed bar-level relative positional encoding (RPE). The relative position accumulates from the previous [Bar] token to the next [Bar] token, implemented by learnable embedding, and then added to the token embedding with the vanilla positional embedding.
  • Figure 4: Quantitative analysis. (a) & (b)Visualization of the note pitch and note velocity distribution of music generated by different models and the Ground Truth.
  • Figure 5: The PCA visualization results of music feature obtained from MIDI-BERT chou2021midibert.
  • ...and 1 more figures