Leave-One-EquiVariant: Alleviating invariance-related information loss in contrastive music representations
Julien Guinot, Elio Quinton, György Fazekas
TL;DR
The paper addresses information loss due to invariances learned by contrastive music representations and proposes a task-adaptive framework, LOEV, that preserves information about chosen augmentations while maintaining general representation quality. LOEV introduces K variant subspaces in which each subspace is invariant to all transformations except one, with a global embedding space V that must encode all transformations; a corresponding LOEV loss combines base and variant objectives. A LOEV++ variant adds a disentangled latent space to enable targeted retrieval and attribute control over augmentations, improving retrieval and augmentation-related task performance without sacrificing overall representation quality. Across MTG-Jamendo pretraining and diverse MIR tasks, LOEV and LOEV++ demonstrate robust performance gains on augmentation-sensitive tasks and retrieval, illustrating practical benefits for task-specific MIR pipelines.
Abstract
Contrastive learning has proven effective in self-supervised musical representation learning, particularly for Music Information Retrieval (MIR) tasks. However, reliance on augmentation chains for contrastive view generation and the resulting learnt invariances pose challenges when different downstream tasks require sensitivity to certain musical attributes. To address this, we propose the Leave One EquiVariant (LOEV) framework, which introduces a flexible, task-adaptive approach compared to previous work by selectively preserving information about specific augmentations, allowing the model to maintain task-relevant equivariances. We demonstrate that LOEV alleviates information loss related to learned invariances, improving performance on augmentation related tasks and retrieval without sacrificing general representation quality. Furthermore, we introduce a variant of LOEV, LOEV++, which builds a disentangled latent space by design in a self-supervised manner, and enables targeted retrieval based on augmentation related attributes.
