Table of Contents
Fetching ...

Audiovisual Masked Autoencoders

Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

TL;DR

Audiovisual MAE extends masked autoencoding to jointly model audio and video, enabling self-supervised learning that exploits cross-modal correlations. By exploring multiple encoder fusion strategies and two learning objectives (joint reconstruction and modality inpainting), the approach learns representations that transfer across audiovisual and unimodal tasks. Empirical results on VGGSound, AudioSet, and Epic Kitchens demonstrate state-of-the-art audiovisual performance and strong transfer to challenging domains, with the mid-fusion encoder and shared decoder often delivering the best results. The work shows that larger, self-supervised audiovisual pretraining yields robust initializations and highlights the benefits of cross-modal pretraining for downstream generalization.

Abstract

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

Audiovisual Masked Autoencoders

TL;DR

Audiovisual MAE extends masked autoencoding to jointly model audio and video, enabling self-supervised learning that exploits cross-modal correlations. By exploring multiple encoder fusion strategies and two learning objectives (joint reconstruction and modality inpainting), the approach learns representations that transfer across audiovisual and unimodal tasks. Empirical results on VGGSound, AudioSet, and Epic Kitchens demonstrate state-of-the-art audiovisual performance and strong transfer to challenging domains, with the mid-fusion encoder and shared decoder often delivering the best results. The work shows that larger, self-supervised audiovisual pretraining yields robust initializations and highlights the benefits of cross-modal pretraining for downstream generalization.

Abstract

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.
Paper Structure (50 sections, 1 equation, 5 figures, 24 tables)

This paper contains 50 sections, 1 equation, 5 figures, 24 tables.

Figures (5)

  • Figure 1: Overview of our Audiovisual Masked Autoencoder. We jointly encode and reconstruct audiovisual inputs, to leverage the correlations between the two modalities to learn stronger representations of the data. Our pretrained encoder can then be used for audiovisual, audio-only and video-only downstream tasks.
  • Figure 2: Transformer architectures for performing audiovisual fusion. Concatenating the tokens before passing them through the transformer corresponds to "early fusion" (a), whilst using two separate encoders (b) can be used to perform "late fusion" in the subsequent decoder. An alternate method of coupling modalities together is by sharing weights between the two encoders (c). Finally, mid-fusion (d) represents a balance between "early" and "late" fusion.
  • Figure 3: Overview of modality inpainting for reconstructing video from audio. We initially jointly encode unmasked tokens from both audio and video. Then, we use all the encoded tokens of one modality (i.e. audio), and mask tokens from the other (i.e. video), to reconstruct the masked modality (i.e. video). Note that we can reconstruct all combinations of modalities, and show one for clarity.
  • Figure 4: Learning curves for the "Joint Reconstruction" and "Modality Inpainting" objectives. Observe how "Joint Reconstruction" is stable across a wide range of learning rates. "Modality Inpainting", on the other hand, only performs well for a learning rate of $1.6 \times 10^{-4}$, and is unstable at higher values. These pretraining experiments were performed on VGGSound for 400 epochs with a batch size of 512.
  • Figure 5: Examples of reconstructions of our model, trained with the "Joint reconstruction" objective on AudioSet. We show video frames on the left, and audio spectrograms on the right. The first row shows the original input, the second the input after masking, and the final row shows the reconstruction produced by the model. For the unmasked patches in the reconstruction, we show the original input. Note that the model is pretrained with 16 video frames, and we show 8 here for clarity. This figure is best viewed on screen, zoomed in.