Table of Contents
Fetching ...

Tempo estimation as fully self-supervised binary classification

Florian Henkel, Jaehun Kim, Matthew C. McCallum, Samuel E. Sandberg, Matthew E. P. Davies

TL;DR

This work tackles global tempo estimation without labeled data by reframing the task as a binary same/different tempo classification using fixed, self-supervised musical embeddings (MULE). Training relies on unlabeled data with time-stretch augmentations and synthetic tempo references, avoiding real-world tempo annotations, while prediction is performed by matching unlabeled tracks against synthetic references. The model achieves competitive Acc2 scores and benefits from a post-hoc tempo-octave correction that improves Acc1, demonstrating that tempo information is embedded in generic audio representations. The approach highlights the potential of fully self-supervised tempo estimation and sets the stage for scalable tempo-aware music analysis under data-scarce conditions.

Abstract

This paper addresses the problem of global tempo estimation in musical audio. Given that annotating tempo is time-consuming and requires certain musical expertise, few publicly available data sources exist to train machine learning models for this task. Towards alleviating this issue, we propose a fully self-supervised approach that does not rely on any human labeled data. Our method builds on the fact that generic (music) audio embeddings already encode a variety of properties, including information about tempo, making them easily adaptable for downstream tasks. While recent work in self-supervised tempo estimation aimed to learn a tempo specific representation that was subsequently used to train a supervised classifier, we reformulate the task into the binary classification problem of predicting whether a target track has the same or a different tempo compared to a reference. While the former still requires labeled training data for the final classification model, our approach uses arbitrary unlabeled music data in combination with time-stretching for model training as well as a small set of synthetically created reference samples for predicting the final tempo. Evaluation of our approach in comparison with the state-of-the-art reveals highly competitive performance when the constraint of finding the precise tempo octave is relaxed.

Tempo estimation as fully self-supervised binary classification

TL;DR

This work tackles global tempo estimation without labeled data by reframing the task as a binary same/different tempo classification using fixed, self-supervised musical embeddings (MULE). Training relies on unlabeled data with time-stretch augmentations and synthetic tempo references, avoiding real-world tempo annotations, while prediction is performed by matching unlabeled tracks against synthetic references. The model achieves competitive Acc2 scores and benefits from a post-hoc tempo-octave correction that improves Acc1, demonstrating that tempo information is embedded in generic audio representations. The approach highlights the potential of fully self-supervised tempo estimation and sets the stage for scalable tempo-aware music analysis under data-scarce conditions.

Abstract

This paper addresses the problem of global tempo estimation in musical audio. Given that annotating tempo is time-consuming and requires certain musical expertise, few publicly available data sources exist to train machine learning models for this task. Towards alleviating this issue, we propose a fully self-supervised approach that does not rely on any human labeled data. Our method builds on the fact that generic (music) audio embeddings already encode a variety of properties, including information about tempo, making them easily adaptable for downstream tasks. While recent work in self-supervised tempo estimation aimed to learn a tempo specific representation that was subsequently used to train a supervised classifier, we reformulate the task into the binary classification problem of predicting whether a target track has the same or a different tempo compared to a reference. While the former still requires labeled training data for the final classification model, our approach uses arbitrary unlabeled music data in combination with time-stretching for model training as well as a small set of synthetically created reference samples for predicting the final tempo. Evaluation of our approach in comparison with the state-of-the-art reveals highly competitive performance when the constraint of finding the precise tempo octave is relaxed.
Paper Structure (13 sections, 3 figures, 1 table)

This paper contains 13 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview on the task setup. For training (a), we sample random mel-spectrogram excerpts and the task is to predict whether they still have the same tempo after being time-stretched (with potentially different stretch factors) and cropped to comprise $3$ seconds of audio. For prediction (b) we compare an unlabeled track against all reference tracks and look for the reference with the highest probability for same tempo which is then reported as the predicted tempo. Note that the embedding network is fixed, i.e., not updated during training, and the same in all stages.
  • Figure 2: Visualization of predicted vs. ground truth tempo for GTZAN, ACM-Mirum and Giantsteps using SDNetargmax. As indicated by the dashed lines, the model makes multiple octave errors.
  • Figure 3: Typical example of an octave error in the GTZAN dataset. We observe high probabilities for the actual annotation as well as the half and double octave.