Table of Contents
Fetching ...

Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures

Alain Riou, Antonin Gagneré, Gaëtan Hadjeres, Stefan Lattner, Geoffroy Peeters

TL;DR

This work addresses zero-shot musical stem retrieval by extending Joint-Embedding Predictive Architectures to arbitrary instruments. It introduces CLAP-based FiLM conditioning for flexible, text-driven instrument conditioning and adds a contrastive pretraining phase to improve latent representations; a two-phase training scheme optimizes an encoder and a predictor to recover target stem representations from context. Evaluations on MUSDB18 and MoisesDB show that CLAP+FiLM conditioning substantially improves retrieval, with competitive performance across different conditioning granularities, and the approach yields robust representations that retain temporal information, as demonstrated by beat-tracking tasks. Overall, the method advances practical zero-shot stem retrieval and suggests broader benefits for music representation learning with potential applicability beyond the MIR domain.

Abstract

In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent representations of a target. In particular, we design our predictor to be conditioned on arbitrary instruments, enabling our model to perform zero-shot stem retrieval. In addition, we discover that pretraining the encoder using contrastive learning drastically improves the model's performance. We validate the retrieval performances of our model using the MUSDB18 and MoisesDB datasets. We show that it significantly outperforms previous baselines on both datasets, showcasing its ability to support more or less precise (and possibly unseen) conditioning. We also evaluate the learned embeddings on a beat tracking task, demonstrating that they retain temporal structure and local information.

Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures

TL;DR

This work addresses zero-shot musical stem retrieval by extending Joint-Embedding Predictive Architectures to arbitrary instruments. It introduces CLAP-based FiLM conditioning for flexible, text-driven instrument conditioning and adds a contrastive pretraining phase to improve latent representations; a two-phase training scheme optimizes an encoder and a predictor to recover target stem representations from context. Evaluations on MUSDB18 and MoisesDB show that CLAP+FiLM conditioning substantially improves retrieval, with competitive performance across different conditioning granularities, and the approach yields robust representations that retain temporal information, as demonstrated by beat-tracking tasks. Overall, the method advances practical zero-shot stem retrieval and suggests broader benefits for music representation learning with potential applicability beyond the MIR domain.

Abstract

In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent representations of a target. In particular, we design our predictor to be conditioned on arbitrary instruments, enabling our model to perform zero-shot stem retrieval. In addition, we discover that pretraining the encoder using contrastive learning drastically improves the model's performance. We validate the retrieval performances of our model using the MUSDB18 and MoisesDB datasets. We show that it significantly outperforms previous baselines on both datasets, showcasing its ability to support more or less precise (and possibly unseen) conditioning. We also evaluate the learned embeddings on a beat tracking task, demonstrating that they retain temporal structure and local information.

Paper Structure

This paper contains 21 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of our model. In both phases, it is trained using a pair composed of a context mix $\mathbf{x}$ and a target stem $\mathbf{\bar{x}}$ extracted from the same track. In Phase 1 (left), we pretrain only the encoder $f_{\theta}$ using contrastive learning to bring together the (averaged) representations $\mathbf{s}$ and $\mathbf{\bar{s}}$ of the context and target. In Phase 2 (right), a predictor $g_{\phi}$, conditioned on the instrument of the target stem $c$, tries to retrieve the patchwise representations of the target stem $\mathbf{\bar{z}}$ from the ones of the context mix $\mathbf{z}$. In this phase, the parameters of the target encoder $f_{\bar{\theta}}$ are updated as an EMA of the ones of the encoder $f_{\theta}$.
  • Figure 2: Analysis of the proportions of stems correctly retrieved by our model (with fine conditioning) on the MoisesDB dataset. The retrieved stem may come from the right audio track or not, and be the right instrument, another instrument from the same category (e.g. "acoustic guitar" vs. "electric guitar") or both wrong. The labels are the ones from MoisesDB and are colored as per their presence in the training set: green if they are in it, orange if they are but written differently (e.g., "female lead vocals" vs. "lead female singer"), and red if they are not (zero-shot scenario).