A-JEPA: Joint-Embedding Predictive Architecture Can Listen
Zhengcong Fei, Mingyuan Fan, Junshi Huang
TL;DR
A-JEPA extends the Joint-Embedding Predictive Architecture to audio by predicting latent-space representations of masked spectrogram regions from visible context. It introduces a curriculum masking strategy that evolves from random blocks to time-frequency aware patterns and uses an EMA-updated target encoder with a multi-mask objective; fine-tuning employs regularized masking to improve robustness. Empirically, A-JEPA achieves state-of-the-art results on AudioSet-2M and AS-20K and outperforms audio pre-training baselines without relying on external non-audio data, while scaling with more data and longer pre-training. The work demonstrates that latent-space predictive pre-training is a scalable and effective paradigm for audio foundation models, with potential for future multi-modal extensions.
Abstract
This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations. The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i.e.}, target encoder, on the whole spectrogram. We find it beneficial to transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated in local time and frequency in audio spectrograms. To enhance contextual semantic understanding and robustness, we fine-tune the encoder with a regularized masking on target datasets, instead of input dropping or zero. Empirically, when built with Vision Transformers structure, we find A-JEPA to be highly scalable and sets new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.
