Table of Contents
Fetching ...

On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations

Matthew C. McCallum, Matthew E. P. Davies, Florian Henkel, Jaehun Kim, Samuel E. Sandberg

TL;DR

The paper addresses how local embedding properties in contrastively learned music audio representations are shaped by intra-pair data augmentation. By evaluating a MULE-based model with time-stretching, pitch-shifting, equalization, and combinations, it demonstrates that augmentations can desensitize tempo and key locality while boosting the locality of genre and instrumentation, thereby improving nearest-neighbor and labeling tasks in a task-dependent manner. Key findings include that tempo locality is a dominant organizing factor, and that time-stretching often yields state-of-the-art results in retrieval tasks, whereas pitch-shifting can enhance tempo-related labeling skills. These insights provide practical guidance for embedding design in music search and recommendation systems, emphasizing the importance of aligning augmentation strategies with downstream objectives.

Abstract

Audio embeddings are crucial tools in understanding large catalogs of music. Typically embeddings are evaluated on the basis of the performance they provide in a wide range of downstream tasks, however few studies have investigated the local properties of the embedding spaces themselves which are important in nearest neighbor algorithms, commonly used in music search and recommendation. In this work we show that when learning audio representations on music datasets via contrastive learning, musical properties that are typically homogeneous within a track (e.g., key and tempo) are reflected in the locality of neighborhoods in the resulting embedding space. By applying appropriate data augmentation strategies, localisation of such properties can not only be reduced but the localisation of other attributes is increased. For example, locality of features such as pitch and tempo that are less relevant to non-expert listeners, may be mitigated while improving the locality of more salient features such as genre and mood, achieving state-of-the-art performance in nearest neighbor retrieval accuracy. Similarly, we show that the optimal selection of data augmentation strategies for contrastive learning of music audio embeddings is dependent on the downstream task, highlighting this as an important embedding design decision.

On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations

TL;DR

The paper addresses how local embedding properties in contrastively learned music audio representations are shaped by intra-pair data augmentation. By evaluating a MULE-based model with time-stretching, pitch-shifting, equalization, and combinations, it demonstrates that augmentations can desensitize tempo and key locality while boosting the locality of genre and instrumentation, thereby improving nearest-neighbor and labeling tasks in a task-dependent manner. Key findings include that tempo locality is a dominant organizing factor, and that time-stretching often yields state-of-the-art results in retrieval tasks, whereas pitch-shifting can enhance tempo-related labeling skills. These insights provide practical guidance for embedding design in music search and recommendation systems, emphasizing the importance of aligning augmentation strategies with downstream objectives.

Abstract

Audio embeddings are crucial tools in understanding large catalogs of music. Typically embeddings are evaluated on the basis of the performance they provide in a wide range of downstream tasks, however few studies have investigated the local properties of the embedding spaces themselves which are important in nearest neighbor algorithms, commonly used in music search and recommendation. In this work we show that when learning audio representations on music datasets via contrastive learning, musical properties that are typically homogeneous within a track (e.g., key and tempo) are reflected in the locality of neighborhoods in the resulting embedding space. By applying appropriate data augmentation strategies, localisation of such properties can not only be reduced but the localisation of other attributes is increased. For example, locality of features such as pitch and tempo that are less relevant to non-expert listeners, may be mitigated while improving the locality of more salient features such as genre and mood, achieving state-of-the-art performance in nearest neighbor retrieval accuracy. Similarly, we show that the optimal selection of data augmentation strategies for contrastive learning of music audio embeddings is dependent on the downstream task, highlighting this as an important embedding design decision.
Paper Structure (9 sections, 3 equations, 3 figures, 3 tables)

This paper contains 9 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Sampling and augmentation pipeline diagram.
  • Figure 2: Mean and interquartile range of cosine distance between embeddings of unmodified tracks and (a) time-stretched tracks, and (b) pitch-shifted tracks, for different augmentation pipelines.
  • Figure 3: Average metrics of neighborhoods of size $k$ for each fine-tuned model (a) Tempo RMMS distance (over AllTempo-test), (b) Key Precision (over GSKey), (c) Tag Precision (over MSD-test).