Table of Contents
Fetching ...

Improved symbolic drum style classification with grammar-based hierarchical representations

Léo Géré, Philippe Rigaux, Nicolas Audebert

TL;DR

This work tackles the problem of representing symbolic MIDI data for deep learning tasks, specifically drumming style classification, by moving beyond common tokenization and piano-roll encodings. It introduces a Linearized Rhythmic Tree (LRT) derived from a context-free musical grammar and augments it with a tree-based positional encoding (TBPE) to preserve hierarchical rhythm information in Transformer models. Empirical results on GrooveMIDI show that LRT with TBPE yields competitive or superior performance at roughly an order of magnitude fewer parameters than comparable LSTM baselines, and it demonstrates improved data efficiency relative to token-based or piano-roll representations. The findings suggest grammar-informed symbolic representations can enable more compact, rhythm-aware models with strong generalization for music style classification and potentially other symbolic-music tasks.

Abstract

Deep learning models have become a critical tool for analysis and classification of musical data. These models operate either on the audio signal, e.g. waveform or spectrogram, or on a symbolic representation, such as MIDI. In the latter, musical information is often reduced to basic features, i.e. durations, pitches and velocities. Most existing works then rely on generic tokenization strategies from classical natural language processing, or matrix representations, e.g. piano roll. In this work, we evaluate how enriched representations of symbolic data can impact deep models, i.e. Transformers and RNN, for music style classification. In particular, we examine representations that explicitly incorporate musical information implicitly present in MIDI-like encodings, such as rhythmic organization, and show that they outperform generic tokenization strategies. We introduce a new tree-based representation of MIDI data built upon a context-free musical grammar. We show that this grammar representation accurately encodes high-level rhythmic information and outperforms existing encodings on the GrooveMIDI Dataset for drumming style classification, while being more compact and parameter-efficient.

Improved symbolic drum style classification with grammar-based hierarchical representations

TL;DR

This work tackles the problem of representing symbolic MIDI data for deep learning tasks, specifically drumming style classification, by moving beyond common tokenization and piano-roll encodings. It introduces a Linearized Rhythmic Tree (LRT) derived from a context-free musical grammar and augments it with a tree-based positional encoding (TBPE) to preserve hierarchical rhythm information in Transformer models. Empirical results on GrooveMIDI show that LRT with TBPE yields competitive or superior performance at roughly an order of magnitude fewer parameters than comparable LSTM baselines, and it demonstrates improved data efficiency relative to token-based or piano-roll representations. The findings suggest grammar-informed symbolic representations can enable more compact, rhythm-aware models with strong generalization for music style classification and potentially other symbolic-music tasks.

Abstract

Deep learning models have become a critical tool for analysis and classification of musical data. These models operate either on the audio signal, e.g. waveform or spectrogram, or on a symbolic representation, such as MIDI. In the latter, musical information is often reduced to basic features, i.e. durations, pitches and velocities. Most existing works then rely on generic tokenization strategies from classical natural language processing, or matrix representations, e.g. piano roll. In this work, we evaluate how enriched representations of symbolic data can impact deep models, i.e. Transformers and RNN, for music style classification. In particular, we examine representations that explicitly incorporate musical information implicitly present in MIDI-like encodings, such as rhythmic organization, and show that they outperform generic tokenization strategies. We introduce a new tree-based representation of MIDI data built upon a context-free musical grammar. We show that this grammar representation accurately encodes high-level rhythmic information and outperforms existing encodings on the GrooveMIDI Dataset for drumming style classification, while being more compact and parameter-efficient.
Paper Structure (22 sections, 2 equations, 6 figures, 1 table)

This paper contains 22 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Different representations of the same two bars of drums. Score (\ref{['fig:score']}) is present for reference only.
  • Figure 2: Example of tree built by qparse after rules simplification and re-rooting of measures (right), with its associated linearization and vector representation (left). In the matrix, the part above the dashed line contains the one-hot encoded rules (blue/yellow for 0/1), and the one below contains the playing instruments for terminal nodes (color representing velocity).
  • Figure 3: Example of tree-based positional encoding (TBPE) for a tree of maximum depth $d_\text{max}=4$.
  • Figure 4: F1 scores on the validation set vs. number of parameters for a selected set of models. We observe that Transformers trained on LRT consistently outperform other models at similar capacity.
  • Figure 5: F1 scores on the validation set vs. percentage of training samples used. Transformers trained on LRT exhibit a less severe performance drop when the number of training samples decreases compared to existing models.
  • ...and 1 more figures