Table of Contents
Fetching ...

Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes

Zixun Guo, Simon Dixon

TL;DR

Moonbeam addresses scalable symbolic-music modeling by introducing a MIDI-focused foundation model with a novel FME tokenizer and Multidimensional Relative Attention (MRA) to encode both absolute and relative attributes in a 5-D musical space (onset, duration, octave, pitch class, velocity) plus instrument. Pretrained on $81.6K$ hours of MIDI data (~$18$ billion tokens), it supports two finetuning pathways for symbolic understanding and conditional generation, and it outperforms prior large-scale symbolic music models on multiple downstream tasks while enabling reliable music infilling with full anticipatory capability. A unified finetuning framework integrates non-temporal conditions and chord information, with LoRA-based efficiency and a GRU decoder to capture inter-attribute dependencies. The work provides open-source code, pretrained models, and generated samples, and suggests broad applicability to MIR and other high-dimensional domains where relative positional information is crucial.

Abstract

Moonbeam is a transformer-based foundation model for symbolic music, pretrained on a large and diverse collection of MIDI data totaling 81.6K hours of music and 18 billion tokens. Moonbeam incorporates music-domain inductive biases by capturing both absolute and relative musical attributes through the introduction of a novel domain-knowledge-inspired tokenization method and Multidimensional Relative Attention (MRA), which captures relative music information without additional trainable parameters. Leveraging the pretrained Moonbeam, we propose 2 finetuning architectures with full anticipatory capabilities, targeting 2 categories of downstream tasks: symbolic music understanding and conditional music generation (including music infilling). Our model outperforms other large-scale pretrained music models in most cases in terms of accuracy and F1 score across 3 downstream music classification tasks on 4 datasets. Moreover, our finetuned conditional music generation model outperforms a strong transformer baseline with a REMI-like tokenizer. We open-source the code, pretrained model, and generated samples on Github.

Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes

TL;DR

Moonbeam addresses scalable symbolic-music modeling by introducing a MIDI-focused foundation model with a novel FME tokenizer and Multidimensional Relative Attention (MRA) to encode both absolute and relative attributes in a 5-D musical space (onset, duration, octave, pitch class, velocity) plus instrument. Pretrained on hours of MIDI data (~ billion tokens), it supports two finetuning pathways for symbolic understanding and conditional generation, and it outperforms prior large-scale symbolic music models on multiple downstream tasks while enabling reliable music infilling with full anticipatory capability. A unified finetuning framework integrates non-temporal conditions and chord information, with LoRA-based efficiency and a GRU decoder to capture inter-attribute dependencies. The work provides open-source code, pretrained models, and generated samples, and suggests broad applicability to MIR and other high-dimensional domains where relative positional information is crucial.

Abstract

Moonbeam is a transformer-based foundation model for symbolic music, pretrained on a large and diverse collection of MIDI data totaling 81.6K hours of music and 18 billion tokens. Moonbeam incorporates music-domain inductive biases by capturing both absolute and relative musical attributes through the introduction of a novel domain-knowledge-inspired tokenization method and Multidimensional Relative Attention (MRA), which captures relative music information without additional trainable parameters. Leveraging the pretrained Moonbeam, we propose 2 finetuning architectures with full anticipatory capabilities, targeting 2 categories of downstream tasks: symbolic music understanding and conditional music generation (including music infilling). Our model outperforms other large-scale pretrained music models in most cases in terms of accuracy and F1 score across 3 downstream music classification tasks on 4 datasets. Moreover, our finetuned conditional music generation model outperforms a strong transformer baseline with a REMI-like tokenizer. We open-source the code, pretrained model, and generated samples on Github.

Paper Structure

This paper contains 29 sections, 11 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Moonbeam Model Architecture.
  • Figure 2: Moonbeam Finetuning Architecture.
  • Figure 3: Distribution of the GPM30 dataset.
  • Figure 4: Two questions designed to filter for qualified participants during the listening test.
  • Figure 5: Listening test interface.