Table of Contents
Fetching ...

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung

TL;DR

EquiAV tackles the challenge that data augmentations can distort audio-visual correspondence in self-supervised learning. It introduces an equivariant framework that learns augmentation-aware intra-modal representations and transfers this information to the inter-modal space via a shared transformation predictor, using the centroid of multiple equivariant embeddings for robust inter-modal supervision. An attention-based predictor encodes augmentation vectors and enables efficient, low-overhead computation since equivariant embeddings are generated from original inputs. Empirical results on AudioSet-2M/20K and VGGSound show state-of-the-art performance in audio-visual event classification and zero-shot retrieval, with comprehensive ablations confirming the benefits of intra-modal equivariance, centroid-based inter-modal learning, and the proposed loss and architecture choices. The approach holds promise for extending equivariance-based strategies to other multi-modal domains, including vision-language tasks.

Abstract

Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks. The code is available on https://github.com/JongSuk1/EquiAV.

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

TL;DR

EquiAV tackles the challenge that data augmentations can distort audio-visual correspondence in self-supervised learning. It introduces an equivariant framework that learns augmentation-aware intra-modal representations and transfers this information to the inter-modal space via a shared transformation predictor, using the centroid of multiple equivariant embeddings for robust inter-modal supervision. An attention-based predictor encodes augmentation vectors and enables efficient, low-overhead computation since equivariant embeddings are generated from original inputs. Empirical results on AudioSet-2M/20K and VGGSound show state-of-the-art performance in audio-visual event classification and zero-shot retrieval, with comprehensive ablations confirming the benefits of intra-modal equivariance, centroid-based inter-modal learning, and the proposed loss and architecture choices. The approach holds promise for extending equivariance-based strategies to other multi-modal domains, including vision-language tasks.

Abstract

Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks. The code is available on https://github.com/JongSuk1/EquiAV.
Paper Structure (35 sections, 21 equations, 3 figures, 13 tables, 1 algorithm)

This paper contains 35 sections, 21 equations, 3 figures, 13 tables, 1 algorithm.

Figures (3)

  • Figure 1: Conceptual illustration of EquiAV. Within the intra-modal latent space, the model learns augmentation-related information by leveraging equivariance. Extending equivariance to the inter-modal space provides robust cross-modal supervision.
  • Figure 2: Overview of the proposed EquiAV framework. Given an audio-visual input pair and its augmented version, the audio encoder and the visual encoder encode them into representations. The transformation predictor takes the original representation $h_m$ and the parameterized augmentation vector $t_m$ as inputs and outputs the equivariant representation $\hat{h}_m$. In the intra-modal latent space, the model is learned to maximize the similarity of the equivariant embedding and the augmented embedding. In the inter-modal latent space, we sample multiple augmentation vectors ${\{t_{m_i}\}_{i\in\{1,...,S\}}}$ and generate the corresponding equivariant representations ${\{\hat{h}_{m_i}\}_{i\in\{1,...,S\}}}$. Then, the centroid $\bar{h}_m$ is computed and used for inter-modal contrastive learning.
  • Figure 3: Qualitative results of EquiAV using VGGSound. Top Left: Original input image, Top Right: Zero-shot sound source localization result for the original input image. Bottom Left: Augmented image whose embedding is the closest to the centroid of equivariant embeddings, Botton Right: Zero-shot sound source localization result for the augmented image.