Table of Contents
Fetching ...

A-JEPA: Joint-Embedding Predictive Architecture Can Listen

Zhengcong Fei, Mingyuan Fan, Junshi Huang

TL;DR

A-JEPA extends the Joint-Embedding Predictive Architecture to audio by predicting latent-space representations of masked spectrogram regions from visible context. It introduces a curriculum masking strategy that evolves from random blocks to time-frequency aware patterns and uses an EMA-updated target encoder with a multi-mask objective; fine-tuning employs regularized masking to improve robustness. Empirically, A-JEPA achieves state-of-the-art results on AudioSet-2M and AS-20K and outperforms audio pre-training baselines without relying on external non-audio data, while scaling with more data and longer pre-training. The work demonstrates that latent-space predictive pre-training is a scalable and effective paradigm for audio foundation models, with potential for future multi-modal extensions.

Abstract

This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations. The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i.e.}, target encoder, on the whole spectrogram. We find it beneficial to transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated in local time and frequency in audio spectrograms. To enhance contextual semantic understanding and robustness, we fine-tune the encoder with a regularized masking on target datasets, instead of input dropping or zero. Empirically, when built with Vision Transformers structure, we find A-JEPA to be highly scalable and sets new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.

A-JEPA: Joint-Embedding Predictive Architecture Can Listen

TL;DR

A-JEPA extends the Joint-Embedding Predictive Architecture to audio by predicting latent-space representations of masked spectrogram regions from visible context. It introduces a curriculum masking strategy that evolves from random blocks to time-frequency aware patterns and uses an EMA-updated target encoder with a multi-mask objective; fine-tuning employs regularized masking to improve robustness. Empirically, A-JEPA achieves state-of-the-art results on AudioSet-2M and AS-20K and outperforms audio pre-training baselines without relying on external non-audio data, while scaling with more data and longer pre-training. The work demonstrates that latent-space predictive pre-training is a scalable and effective paradigm for audio foundation models, with potential for future multi-modal extensions.

Abstract

This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations. The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i.e.}, target encoder, on the whole spectrogram. We find it beneficial to transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated in local time and frequency in audio spectrograms. To enhance contextual semantic understanding and robustness, we fine-tune the encoder with a regularized masking on target datasets, instead of input dropping or zero. Empirically, when built with Vision Transformers structure, we find A-JEPA to be highly scalable and sets new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.
Paper Structure (24 sections, 1 equation, 8 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 1 equation, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of A-JEPA. The audio-based joint-embedding predictive architecture uses a context encoder to predict and align the representations of various target blocks in a latent space, originating from the same audio spectrogram.
  • Figure 2: Examples of our context and target-masking strategy on Mel-spectrograms. Given an audio spectrogram, (i) random block: we randomly sample 4 target blocks with scale in the range (0.15, 0.2) and aspect ratio in the range (0.75, 1.5); (ii) time and frequency: we randomly sample 3 target blocks with scale in the range (0.05, 0.075) and remove all the related time or frequency in the total Mel-spectrograms. Next, we randomly sample a context block with a scale in the range (0.85, 1.0) and remove any overlapping target blocks. $\color{green}{\texttt{green}}$ patches are selected while $\color{gray}{\texttt{gray}}$ patches are removed.
  • Figure 3: Different curriculum function. With the increase of the pre-training step, progressive function approx to hard (1) from easy (0) in different trends.
  • Figure 4: Regularize patch masking during fine-tuning. Masked patch is forbidden to attend to attention computation where its attention score is entirely contributed by others. It manipulates the connections between patch tokens in self-attention via masking, where the networks are forced to exploit partial neighbors’ information to produce a meaningful representation.
  • Figure 5: Visualization of A-JEPA predictor representations. The first column presents the original audio spectrogram, while the second column displays the context audio spectrogram, which is processed with a pre-trained A-JEPA ViT-B encoder. $\color{red}{\texttt{red}}$ bounding boxes, in subsequent columns, showcase samples created from a generative model. It decoded the output of the pre-trained A-JEPA predictor, conditioned on positional mask tokens corresponding to the location of the bounding box. It is worth noting that qualities shared among these samples indicate the information contained in the A-JEPA prediction.
  • ...and 3 more figures