Table of Contents
Fetching ...

Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

Alain Riou, Stefan Lattner, Gaëtan Hadjeres, Geoffroy Peeters

TL;DR

This work tackles self-supervised general audio representation learning using Joint-Embedding Predictive Architectures (JEPA) by learning to predict target representations from masked context patches in log-mel spectrograms. It formalizes the training with two encoders, a predictor, and an exponential-moving-average target encoder, optimizing a smoothed $L_1$ loss ${\mathcal{L}}$ to align predicted and target embeddings. Through extensive experiments on AudioSet-pretraining and linear evaluation across eight downstream tasks, it shows that masking strategy and temporal duration have strong, modality-specific effects: unstructured masking generally outperforms multi-block and time-based masking for mel-spectrograms, and longer context helps some tasks while hurting others; target-domain masking in latent space often degrades performance. The findings reveal notable differences between audio and image domains, highlight the effectiveness of Vision-Transformer–based encoders for audio, and provide actionable guidance for designing general-purpose audio SSL systems. Overall, the work advances practical understanding of how to tailor JEPA-style self-supervised learning to audio data and multi-domain downstream tasks.

Abstract

This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.

Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

TL;DR

This work tackles self-supervised general audio representation learning using Joint-Embedding Predictive Architectures (JEPA) by learning to predict target representations from masked context patches in log-mel spectrograms. It formalizes the training with two encoders, a predictor, and an exponential-moving-average target encoder, optimizing a smoothed loss to align predicted and target embeddings. Through extensive experiments on AudioSet-pretraining and linear evaluation across eight downstream tasks, it shows that masking strategy and temporal duration have strong, modality-specific effects: unstructured masking generally outperforms multi-block and time-based masking for mel-spectrograms, and longer context helps some tasks while hurting others; target-domain masking in latent space often degrades performance. The findings reveal notable differences between audio and image domains, highlight the effectiveness of Vision-Transformer–based encoders for audio, and provide actionable guidance for designing general-purpose audio SSL systems. Overall, the work advances practical understanding of how to tailor JEPA-style self-supervised learning to audio data and multi-domain downstream tasks.

Abstract

This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.
Paper Structure (18 sections, 3 equations, 2 figures, 3 tables)

This paper contains 18 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Investigated masking strategies. For Multi-block and Time strategies, target and context blocks are sampled independently, and then patches in the target blocks are eventually removed from the context one.
  • Figure 2: Overview of our framework. From an input mel-spectrogram, we first extract context and target patches via masking. These masked inputs are fed through their respective encoders to produce context and target representations. We then add positional embeddings to the context representations at the target's positions and pass the constructed sequence through a predictor, whose patch-level outputs are finally compared to the target representations.