Table of Contents
Fetching ...

WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms

Goksenin Yuksel, Pierre Guetschel, Michael Tangermann, Marcel van Gerven, Kiki van der Heijden

TL;DR

WavJEPA introduces semantic learning to general-purpose audio representation learning directly from raw waveforms, addressing spectral-method latency and phase loss while delivering superior performance on HEAR and ARCH with substantially lower compute. The core framework predicts latent target representations from contextual waveform segments using a EMA-updated target encoder, enabling robust, high-level semantic understanding of sound. A Nat variant, WavJEPA-Nat, extends to multi-channel, spatialized naturalistic scenes, improving resilience to noise and reverberation. Together, these approaches demonstrate feasible, efficient time-domain foundation models with strong transfer to real-world acoustic environments and potential for low-latency audio generation tasks.

Abstract

Learning audio representations from raw waveforms overcomes key limitations of spectrogram-based audio representation learning, such as the long latency of spectrogram computation and the loss of phase information. Yet, while self-supervised speech representation learning from raw waveforms has been remarkably successful, these approaches have not achieved similar feats for general-purpose audio representation learning from waveforms. Here, we propose WavJEPA, a waveform-based version of the Joint-Embedding Predictive Architecture. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources. Additionally, to overcome the performance drop that time-domain models typically exhibit in noisy and reverberant real-world acoustic environments, we present WavJEPA-Nat. WavJEPA-Nat is a multi-channel extension of the WavJEPA architecture trained on simulated naturalistic scenes. We find that WavJEPA-Nat is highly robust to reverberation and noise. These results highlight the feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, showcasing the potential for low-latency, robust time-domain audio foundation models for real-world applications.

WavJEPA: Semantic learning unlocks robust audio foundation models for raw waveforms

TL;DR

WavJEPA introduces semantic learning to general-purpose audio representation learning directly from raw waveforms, addressing spectral-method latency and phase loss while delivering superior performance on HEAR and ARCH with substantially lower compute. The core framework predicts latent target representations from contextual waveform segments using a EMA-updated target encoder, enabling robust, high-level semantic understanding of sound. A Nat variant, WavJEPA-Nat, extends to multi-channel, spatialized naturalistic scenes, improving resilience to noise and reverberation. Together, these approaches demonstrate feasible, efficient time-domain foundation models with strong transfer to real-world acoustic environments and potential for low-latency audio generation tasks.

Abstract

Learning audio representations from raw waveforms overcomes key limitations of spectrogram-based audio representation learning, such as the long latency of spectrogram computation and the loss of phase information. Yet, while self-supervised speech representation learning from raw waveforms has been remarkably successful, these approaches have not achieved similar feats for general-purpose audio representation learning from waveforms. Here, we propose WavJEPA, a waveform-based version of the Joint-Embedding Predictive Architecture. WavJEPA leverages high-level semantic representation learning to tackle the shortcomings of representation learning at the speech unit or token level. We show that this approach substantially outperforms state-of-the-art time-domain audio foundation models across a wide variety of downstream benchmark tasks, while requiring considerably fewer computational resources. Additionally, to overcome the performance drop that time-domain models typically exhibit in noisy and reverberant real-world acoustic environments, we present WavJEPA-Nat. WavJEPA-Nat is a multi-channel extension of the WavJEPA architecture trained on simulated naturalistic scenes. We find that WavJEPA-Nat is highly robust to reverberation and noise. These results highlight the feasibility and computational efficiency of general-purpose audio representation learning from raw waveforms, showcasing the potential for low-latency, robust time-domain audio foundation models for real-world applications.

Paper Structure

This paper contains 20 sections, 1 equation, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Semantic representation learning from raw waveforms. WavJEPA predicts latent target representations at specific locations from a context representation. The weights of the target encoder are not trained but updated using the exponential moving average (EMA) of the weights of the contextencoder.
  • Figure 2: Downstream task performance $s(m)$ vs. pre-training data (AudioSet). Symbols depict performance $s$ for HEAR (left panel) and for ARCH (right panel) as a function of number of samples seen during pre-training. Symbol size reflects the number of model parameters. For WavJEPA, we depict performance after 50 k, 100 k, 200 k and 375 k training steps.
  • Figure 3: Ablation studies. The left panel compares the performances on HEAR and NatHEAR for the WavJEPA-Nat architecture as a function of the ratio ($\lambda$) between clean and naturalistic scenes in the pre-training data. The middle panel depicts the impact of the top-$K$ averaging parameter per HEAR task for WavJEPA. The right panel compares the impact of target length ($M_{target}$) per task. The middle and right panels include only HEAR tasks for which WavJEPA performed better than baseline for ease of visualization.
  • Figure 4: Robust representation learning from naturalistic sound scenes including noise and reverberation. WavJEPA-Nat is a multi-channel extension of WavJEPA which uses a dual waveform encoder to learn inter- and intra-channel characteristics and predicts 2D latent target representations from a 2D context block. The weights of the target encoder are not trained but updated using the exponential moving average (EMA) of the weights of the context encoder.