EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung
TL;DR
EquiAV tackles the challenge that data augmentations can distort audio-visual correspondence in self-supervised learning. It introduces an equivariant framework that learns augmentation-aware intra-modal representations and transfers this information to the inter-modal space via a shared transformation predictor, using the centroid of multiple equivariant embeddings for robust inter-modal supervision. An attention-based predictor encodes augmentation vectors and enables efficient, low-overhead computation since equivariant embeddings are generated from original inputs. Empirical results on AudioSet-2M/20K and VGGSound show state-of-the-art performance in audio-visual event classification and zero-shot retrieval, with comprehensive ablations confirming the benefits of intra-modal equivariance, centroid-based inter-modal learning, and the proposed loss and architecture choices. The approach holds promise for extending equivariance-based strategies to other multi-modal domains, including vision-language tasks.
Abstract
Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks. The code is available on https://github.com/JongSuk1/EquiAV.
