Table of Contents
Fetching ...

Beat this! Accurate beat tracking without DBN postprocessing

Francesco Foscarin, Jan Schlüter, Gerhard Widmer

TL;DR

This work addresses beat and downbeat tracking with a focus on broad generality across diverse music without relying on Dynamic Bayesian Network postprocessing. It introduces a ~20 M parameter model with a frontend, partial-frequency/time transformers, rotary positional embeddings, and a shift-tolerant loss, achieving state-of-the-art F1 without a DBN on 18 datasets. Ablations show the shift-tolerant loss, partial transformers, and data augmentation are key to performance, though continuity metrics suffer in complex pieces, suggesting a trade-off between local accuracy and global periodicity. The authors provide open-source code, pretrained models, and datasets to invite community improvement and outline future directions, including model compression, improved loss functions enforcing periodicity, and dataset quality enhancements.

Abstract

We propose a system for tracking beats and downbeats with two objectives: generality across a diverse music range, and high accuracy. We achieve generality by training on multiple datasets -- including solo instrument recordings, pieces with time signature changes, and classical music with high tempo variations -- and by removing the commonly used Dynamic Bayesian Network (DBN) postprocessing, which introduces constraints on the meter and tempo. For high accuracy, among other improvements, we develop a loss function tolerant to small time shifts of annotations, and an architecture alternating convolutions with transformers either over frequency or time. Our system surpasses the current state of the art in F1 score despite using no DBN. However, it can still fail, especially for difficult and underrepresented genres, and performs worse on continuity metrics, so we publish our model, code, and preprocessed datasets, and invite others to beat this.

Beat this! Accurate beat tracking without DBN postprocessing

TL;DR

This work addresses beat and downbeat tracking with a focus on broad generality across diverse music without relying on Dynamic Bayesian Network postprocessing. It introduces a ~20 M parameter model with a frontend, partial-frequency/time transformers, rotary positional embeddings, and a shift-tolerant loss, achieving state-of-the-art F1 without a DBN on 18 datasets. Ablations show the shift-tolerant loss, partial transformers, and data augmentation are key to performance, though continuity metrics suffer in complex pieces, suggesting a trade-off between local accuracy and global periodicity. The authors provide open-source code, pretrained models, and datasets to invite community improvement and outline future directions, including model compression, improved loss functions enforcing periodicity, and dataset quality enhancements.

Abstract

We propose a system for tracking beats and downbeats with two objectives: generality across a diverse music range, and high accuracy. We achieve generality by training on multiple datasets -- including solo instrument recordings, pieces with time signature changes, and classical music with high tempo variations -- and by removing the commonly used Dynamic Bayesian Network (DBN) postprocessing, which introduces constraints on the meter and tempo. For high accuracy, among other improvements, we develop a loss function tolerant to small time shifts of annotations, and an architecture alternating convolutions with transformers either over frequency or time. Our system surpasses the current state of the art in F1 score despite using no DBN. However, it can still fail, especially for difficult and underrepresented genres, and performs worse on continuity metrics, so we publish our model, code, and preprocessed datasets, and invite others to beat this.
Paper Structure (18 sections, 3 figures, 3 tables)

This paper contains 18 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Full model architecture.
  • Figure 2: The standard binary cross-entropy loss (left plot) encourages high network outputs (upward arrow) at beat annotations (vertical line), and low outputs for all other frames (downward arrows). Max-pooling the predictions over time redistributes gradients to local maxima (right plot). This way, slightly shifted annotations do not affect learning, and the network produces confident sharp peaks.
  • Figure :