Analyzing and reducing the synthetic-to-real transfer gap in Music Information Retrieval: the task of automatic drum transcription

Mickaël Zehren; Marco Alunno; Paolo Bientinesi

Analyzing and reducing the synthetic-to-real transfer gap in Music Information Retrieval: the task of automatic drum transcription

Mickaël Zehren, Marco Alunno, Paolo Bientinesi

TL;DR

The paper tackles the synthetic-to-real transfer gap in automatic drum transcription by proposing three realism-enhancing strategies and a new synthetic dataset, ADTOS, designed from human-performed MIDI loops with accompaniment and abundant presets. It uses scaling-law analysis to quantify how performance scales with training data and to estimate the minimal generalization error (gamma) achievable with each generation procedure. Results show ADTOS improves realism and reduces the transfer gap relative to prior synthetic datasets, though performance on real data remains superior to any synthetic-only training. The study also provides ablations showing that human-performed MIDI, accompanying instruments, and large timbre diversity each contribute to narrowing the transfer gap, with diminishing returns beyond certain points.

Abstract

Automatic drum transcription is a critical tool in Music Information Retrieval for extracting and analyzing the rhythm of a music track, but it is limited by the size of the datasets available for training. A popular method used to increase the amount of data is by generating them synthetically from music scores rendered with virtual instruments. This method can produce a virtually infinite quantity of tracks, but empirical evidence shows that models trained on previously created synthetic datasets do not transfer well to real tracks. In this work, besides increasing the amount of data, we identify and evaluate three more strategies that practitioners can use to improve the realism of the generated data and, thus, narrow the synthetic-to-real transfer gap. To explore their efficacy, we used them to build a new synthetic dataset and then we measured how the performance of a model scales and, specifically, at what value it will stagnate when increasing the number of training tracks for different datasets. By doing this, we were able to prove that the aforementioned strategies contribute to make our dataset the one with the most realistic data distribution and the lowest synthetic-to-real transfer gap among the synthetic datasets we evaluated. We conclude by highlighting the limits of training with infinite data in drum transcription and we show how they can be overcome.

Analyzing and reducing the synthetic-to-real transfer gap in Music Information Retrieval: the task of automatic drum transcription

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 4 figures, 1 table)

This paper contains 16 sections, 1 equation, 4 figures, 1 table.

Introduction
Related works
Semi-automatic annotations
Synthetic datasets
Training dataset
Generation procedure
Comparing data distributions
Experimental design
Independent variable
Dependent variables
Control variables
Results
Transfer gap for different generation procedures
Ablation study
Conclusions
...and 1 more sections

Figures (4)

Figure 1: Violin and bar plots representing respectively continuous variables distributions (tempo, velocity, and onset interval) and discrete variables distributions (time signature and class), for the synthetic datasets (left column) and real-world datasets (right column). The distributions are normalized, so that each plot has the same width.
Figure 2: Validation and test loss in function of the number of tracks when training on different datasets. The solid lines represent the learning curves, fitted in the log space, from equation \ref{['eq:scaling']}. The dashed lines represent the value of $\gamma$, the lower bound of the loss. Notice the log-log scale.
Figure 3: Learning curves for different versions of ADTOS by modifying: a) the MIDI source, b) the number of voices, and c) the number of presets.
Figure 4: Relative frequency at which beats (left) or unique beat sequences (right) from a target are included in the source. Numbers in parentheses represent the count of beats or sequences in the datasets.

Analyzing and reducing the synthetic-to-real transfer gap in Music Information Retrieval: the task of automatic drum transcription

TL;DR

Abstract

Analyzing and reducing the synthetic-to-real transfer gap in Music Information Retrieval: the task of automatic drum transcription

Authors

TL;DR

Abstract

Table of Contents

Figures (4)