Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning
Alain Riou, Stefan Lattner, Gaëtan Hadjeres, Geoffroy Peeters
TL;DR
This work tackles self-supervised general audio representation learning using Joint-Embedding Predictive Architectures (JEPA) by learning to predict target representations from masked context patches in log-mel spectrograms. It formalizes the training with two encoders, a predictor, and an exponential-moving-average target encoder, optimizing a smoothed $L_1$ loss ${\mathcal{L}}$ to align predicted and target embeddings. Through extensive experiments on AudioSet-pretraining and linear evaluation across eight downstream tasks, it shows that masking strategy and temporal duration have strong, modality-specific effects: unstructured masking generally outperforms multi-block and time-based masking for mel-spectrograms, and longer context helps some tasks while hurting others; target-domain masking in latent space often degrades performance. The findings reveal notable differences between audio and image domains, highlight the effectiveness of Vision-Transformer–based encoders for audio, and provide actionable guidance for designing general-purpose audio SSL systems. Overall, the work advances practical understanding of how to tailor JEPA-style self-supervised learning to audio data and multi-domain downstream tasks.
Abstract
This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.
