Table of Contents
Fetching ...

Contrastive Audio-Visual Masked Autoencoder

Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

TL;DR

This work tackles scalable, fully self-supervised audio-visual understanding by introducing CAV-MAE, which fuses Contrastive Audio-Visual (CAV) learning with Masked Autoencoder (MAE) objectives. The model uses modality-specific encoders and a joint AV encoder, with multi-stream forward passes to maintain modality-specific contrastive signals while enabling cross-modal fusion through reconstruction. Empirically, CAV-MAE achieves state-of-the-art results on VGGSound and competitive AudioSet performance, while also excelling in audio-visual retrieval and improving single-modal downstream tasks. The findings demonstrate that pairing contrastive alignment with masked data modeling yields complementary benefits for robust, scalable audio-visual representation learning.

Abstract

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae.

Contrastive Audio-Visual Masked Autoencoder

TL;DR

This work tackles scalable, fully self-supervised audio-visual understanding by introducing CAV-MAE, which fuses Contrastive Audio-Visual (CAV) learning with Masked Autoencoder (MAE) objectives. The model uses modality-specific encoders and a joint AV encoder, with multi-stream forward passes to maintain modality-specific contrastive signals while enabling cross-modal fusion through reconstruction. Empirically, CAV-MAE achieves state-of-the-art results on VGGSound and competitive AudioSet performance, while also excelling in audio-visual retrieval and improving single-modal downstream tasks. The findings demonstrate that pairing contrastive alignment with masked data modeling yields complementary benefits for robust, scalable audio-visual representation learning.

Abstract

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae.
Paper Structure (31 sections, 6 equations, 13 figures, 16 tables)

This paper contains 31 sections, 6 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: An illustration of our method. A) We tokenize audio spectrograms and RGB images into 16$\times$16 square patches and use them as the input to all models. B) Conventional contrastive audio-visual learning model (top) and vanilla audio-visual masked auto-encoder (bottom, also novel and first introduced in this paper). C) Our proposed contrastive audio-visual masked auto-encoder (CAV-MAE) model. CAV-MAE integrates two major self-supervised frameworks: contrastive audio-visual learning and cross-modal masked data modeling, which learns a joint and coordinate representations and performs well on both multi-modal joint classification tasks and cross-modal retrieval tasks.
  • Figure 2: Sample retrieval results.
  • Figure 3: Illustration of various masking strategies. We use uniform unstructured masking throughout the paper except in Section \ref{['sec:impact_masking']}.
  • Figure 4: Classification performance as a function of the number of frames used on Kinetics-Sounds (left), AudioSet-20K (middle), and VGGSound (right). Frames are uniformly sampled from each video clip. The performance consistently improves with more frames being used, but the improvement saturates with the increase of frames.
  • Figure 5: Audio spectrogram reconstruction mean squared error (MSE) as a function of masking ratio under various inference masking settings (from left to right: time masking, frequency masking, time-frequency masking, and uniform unstructured masking). We compare a CAV-MAE model trained with uniform masking (blue) and a CAV-MAE model trained with time-frequency masking (red). Both models are trained with a 75% masking ratio. Key findings are as follows: 1) Even for the same masking ratio, the reconstruction hardness is different for each masking strategy. On average, time masking is the most difficult, followed by frequency masking, time-frequency masking, and uniform unstructured masking. This indicates that CAV-MAE models require local information for the reconstruction task. However, for each specific spectrogram, the order of difficulty varies (see Figure \ref{['fig:mask_spec_0.75']} and \ref{['fig:mask_spec_0.9']}). Second, the CAV-MAE model trained with time-frequency masking generally performs better than its counterpart trained with uniform masking in audio spectrogram reconstruction, particularly for the time masking and frequency masking settings, showing it is stronger in leveraging global information. This indicates different training masking strategies do impact the properties of the model.
  • ...and 8 more figures