Table of Contents
Fetching ...

Leave-One-EquiVariant: Alleviating invariance-related information loss in contrastive music representations

Julien Guinot, Elio Quinton, György Fazekas

TL;DR

The paper addresses information loss due to invariances learned by contrastive music representations and proposes a task-adaptive framework, LOEV, that preserves information about chosen augmentations while maintaining general representation quality. LOEV introduces K variant subspaces in which each subspace is invariant to all transformations except one, with a global embedding space V that must encode all transformations; a corresponding LOEV loss combines base and variant objectives. A LOEV++ variant adds a disentangled latent space to enable targeted retrieval and attribute control over augmentations, improving retrieval and augmentation-related task performance without sacrificing overall representation quality. Across MTG-Jamendo pretraining and diverse MIR tasks, LOEV and LOEV++ demonstrate robust performance gains on augmentation-sensitive tasks and retrieval, illustrating practical benefits for task-specific MIR pipelines.

Abstract

Contrastive learning has proven effective in self-supervised musical representation learning, particularly for Music Information Retrieval (MIR) tasks. However, reliance on augmentation chains for contrastive view generation and the resulting learnt invariances pose challenges when different downstream tasks require sensitivity to certain musical attributes. To address this, we propose the Leave One EquiVariant (LOEV) framework, which introduces a flexible, task-adaptive approach compared to previous work by selectively preserving information about specific augmentations, allowing the model to maintain task-relevant equivariances. We demonstrate that LOEV alleviates information loss related to learned invariances, improving performance on augmentation related tasks and retrieval without sacrificing general representation quality. Furthermore, we introduce a variant of LOEV, LOEV++, which builds a disentangled latent space by design in a self-supervised manner, and enables targeted retrieval based on augmentation related attributes.

Leave-One-EquiVariant: Alleviating invariance-related information loss in contrastive music representations

TL;DR

The paper addresses information loss due to invariances learned by contrastive music representations and proposes a task-adaptive framework, LOEV, that preserves information about chosen augmentations while maintaining general representation quality. LOEV introduces K variant subspaces in which each subspace is invariant to all transformations except one, with a global embedding space V that must encode all transformations; a corresponding LOEV loss combines base and variant objectives. A LOEV++ variant adds a disentangled latent space to enable targeted retrieval and attribute control over augmentations, improving retrieval and augmentation-related task performance without sacrificing overall representation quality. Across MTG-Jamendo pretraining and diverse MIR tasks, LOEV and LOEV++ demonstrate robust performance gains on augmentation-sensitive tasks and retrieval, illustrating practical benefits for task-specific MIR pipelines.

Abstract

Contrastive learning has proven effective in self-supervised musical representation learning, particularly for Music Information Retrieval (MIR) tasks. However, reliance on augmentation chains for contrastive view generation and the resulting learnt invariances pose challenges when different downstream tasks require sensitivity to certain musical attributes. To address this, we propose the Leave One EquiVariant (LOEV) framework, which introduces a flexible, task-adaptive approach compared to previous work by selectively preserving information about specific augmentations, allowing the model to maintain task-relevant equivariances. We demonstrate that LOEV alleviates information loss related to learned invariances, improving performance on augmentation related tasks and retrieval without sacrificing general representation quality. Furthermore, we introduce a variant of LOEV, LOEV++, which builds a disentangled latent space by design in a self-supervised manner, and enables targeted retrieval based on augmentation related attributes.

Paper Structure

This paper contains 18 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Leave One EquiVariant framework. Subspace $\mathcal{Z}^k$ is invariant to all transformations except $T_k$, forcing the embedding superspace $\mathcal{V}$ to conserve information about all transformations.
  • Figure 2: Retrieval metrics for retrieved tags (MagnaTagATune), key (Giantsteps), and tempo (AllTempo) - Precision@K, Weighted accuracy and $acc_1$ are computed between the seed track embedding and the retrieved labels for the $k \in [1,3,5,10]$ nearest neighbouring embeddings.
  • Figure 3: Cosine distance between embeddings of pitch-shifted and non-pitched audio snippets for LOEV and MULE (left) and LOEV++ (right) and different subspaces and pretraining.
  • Figure 4: LOEV(++) architectures. In either case, the probing representation can come from the embedding superspace, or, for LOEV++, the concatenation of the parallel superspaces can be used, as in xiaoWhatShouldNot2021