Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures
Alain Riou, Antonin Gagneré, Gaëtan Hadjeres, Stefan Lattner, Geoffroy Peeters
TL;DR
This work addresses zero-shot musical stem retrieval by extending Joint-Embedding Predictive Architectures to arbitrary instruments. It introduces CLAP-based FiLM conditioning for flexible, text-driven instrument conditioning and adds a contrastive pretraining phase to improve latent representations; a two-phase training scheme optimizes an encoder and a predictor to recover target stem representations from context. Evaluations on MUSDB18 and MoisesDB show that CLAP+FiLM conditioning substantially improves retrieval, with competitive performance across different conditioning granularities, and the approach yields robust representations that retain temporal information, as demonstrated by beat-tracking tasks. Overall, the method advances practical zero-shot stem retrieval and suggests broader benefits for music representation learning with potential applicability beyond the MIR domain.
Abstract
In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent representations of a target. In particular, we design our predictor to be conditioned on arbitrary instruments, enabling our model to perform zero-shot stem retrieval. In addition, we discover that pretraining the encoder using contrastive learning drastically improves the model's performance. We validate the retrieval performances of our model using the MUSDB18 and MoisesDB datasets. We show that it significantly outperforms previous baselines on both datasets, showcasing its ability to support more or less precise (and possibly unseen) conditioning. We also evaluate the learned embeddings on a beat tracking task, demonstrating that they retain temporal structure and local information.
