Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

Matthew C. McCallum; Florian Henkel; Jaehun Kim; Samuel E. Sandberg; Matthew E. P. Davies

Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

Matthew C. McCallum, Florian Henkel, Jaehun Kim, Samuel E. Sandberg, Matthew E. P. Davies

TL;DR

The paper tackles the challenge of context-specific audio similarity by enabling tempo-focused manipulation within a fixed embedding space. It learns a self-supervised tempo translation function $f$ that, given an embedding $z$ and a tempo factor $ au$, predicts the translated embedding corresponding to tempo $T'= au T$, leveraging time-stretched targets without retraining the backbone model. The approach enables efficient tempo-specific and tempo-contour nearest-neighbor retrieval, and provides a data augmentation path for tempo prediction that rivals Mel-spectrogram augmentation while avoiding audio reprocessing. Empirically, tempo-aware embedding manipulation improves both retrieval quality and tempo labeling performance, demonstrating practical benefits for scalable, tempo-aware music search and discovery. The work paves the way for extending embedding-space manipulation to additional musical attributes beyond tempo.

Abstract

Audio embeddings enable large scale comparisons of the similarity of audio files for applications such as search and recommendation. Due to the subjectivity of audio similarity, it can be desirable to design systems that answer not only whether audio is similar, but similar in what way (e.g., wrt. tempo, mood or genre). Previous works have proposed disentangled embedding spaces where subspaces representing specific, yet possibly correlated, attributes can be weighted to emphasize those attributes in downstream tasks. However, no research has been conducted into the independence of these subspaces, nor their manipulation, in order to retrieve tracks that are similar but different in a specific way. Here, we explore the manipulation of tempo in embedding spaces as a case-study towards this goal. We propose tempo translation functions that allow for efficient manipulation of tempo within a pre-existing embedding space whilst maintaining other properties such as genre. As this translation is specific to tempo it enables retrieval of tracks that are similar but have specifically different tempi. We show that such a function can be used as an efficient data augmentation strategy for both training of downstream tempo predictors, and improved nearest neighbor retrieval of properties largely independent of tempo.

Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

TL;DR

The paper tackles the challenge of context-specific audio similarity by enabling tempo-focused manipulation within a fixed embedding space. It learns a self-supervised tempo translation function

that, given an embedding

and a tempo factor

, predicts the translated embedding corresponding to tempo

, leveraging time-stretched targets without retraining the backbone model. The approach enables efficient tempo-specific and tempo-contour nearest-neighbor retrieval, and provides a data augmentation path for tempo prediction that rivals Mel-spectrogram augmentation while avoiding audio reprocessing. Empirically, tempo-aware embedding manipulation improves both retrieval quality and tempo labeling performance, demonstrating practical benefits for scalable, tempo-aware music search and discovery. The work paves the way for extending embedding-space manipulation to additional musical attributes beyond tempo.

Abstract

Paper Structure (9 sections, 6 equations, 2 figures, 2 tables)

This paper contains 9 sections, 6 equations, 2 figures, 2 tables.

Introduction
Methodology
Music Audio Embeddings
Learning a Tempo Translation Function
Experiments & Results
Nearest Neighbor Retrieval of Specific Tempo
Nearest Neighbor Retrieval Impartial to Tempo
Data augmentation for downstream tempo labelling
Conclusions

Figures (2)

Figure 1: Outline of the training setup. Given a mel-spectrogram excerpt, the task of the translation function is to predicted the translated embedding of a time-stretched version of said excerpt.
Figure 2: Alignment between tempo and tags of the source embeddings and their $5$ nearest neighbors (NN) across different factors for embedding translation, audio translation (based on Sox time-stretching), and untranslated embeddings. Tempo alignment is reported on the Gtzan dataset and is measured by Accuracy 1 (Acc1) gouyon06taslp between the translated source tempo and nearest neighbor tempi. For tag alignment, we report tag precision of neighbors on the test partition of the MTT dataset.

Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

TL;DR

Abstract

Similar but Faster: Manipulation of Tempo in Music Audio Embeddings for Tempo Prediction and Search

Authors

TL;DR

Abstract

Table of Contents

Figures (2)