Audiovisual Masked Autoencoders
Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab
TL;DR
Audiovisual MAE extends masked autoencoding to jointly model audio and video, enabling self-supervised learning that exploits cross-modal correlations. By exploring multiple encoder fusion strategies and two learning objectives (joint reconstruction and modality inpainting), the approach learns representations that transfer across audiovisual and unimodal tasks. Empirical results on VGGSound, AudioSet, and Epic Kitchens demonstrate state-of-the-art audiovisual performance and strong transfer to challenging domains, with the mid-fusion encoder and shared decoder often delivering the best results. The work shows that larger, self-supervised audiovisual pretraining yields robust initializations and highlights the benefits of cross-modal pretraining for downstream generalization.
Abstract
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.
