Table of Contents
Fetching ...

Contrastive Learning Via Equivariant Representation

Sifan Song, Jinfeng Wang, Qiaochu Zhao, Xiang Li, Dufan Wu, Angelos Stefanidis, Jionglong Su, S. Kevin Zhou, Quanzheng Li

TL;DR

Experimental results demonstrate that CLeVER effectively extracts and incorporates equivariant information from practical natural images, thereby improving the training efficiency and robustness of baseline models in downstream tasks and achieving state-of-the-art (SOTA) performance.

Abstract

Invariant Contrastive Learning (ICL) methods have achieved impressive performance across various domains. However, the absence of latent space representation for distortion (augmentation)-related information in the latent space makes ICL sub-optimal regarding training efficiency and robustness in downstream tasks. Recent studies suggest that introducing equivariance into Contrastive Learning (CL) can improve overall performance. In this paper, we revisit the roles of augmentation strategies and equivariance in improving CL's efficacy. We propose CLeVER (Contrastive Learning Via Equivariant Representation), a novel equivariant contrastive learning framework compatible with augmentation strategies of arbitrary complexity for various mainstream CL backbone models. Experimental results demonstrate that CLeVER effectively extracts and incorporates equivariant information from practical natural images, thereby improving the training efficiency and robustness of baseline models in downstream tasks and achieving state-of-the-art (SOTA) performance. Moreover, we find that leveraging equivariant information extracted by CLeVER simultaneously enhances rotational invariance and sensitivity across experimental tasks, and helps stabilize the framework when handling complex augmentations, particularly for models with small-scale backbones.

Contrastive Learning Via Equivariant Representation

TL;DR

Experimental results demonstrate that CLeVER effectively extracts and incorporates equivariant information from practical natural images, thereby improving the training efficiency and robustness of baseline models in downstream tasks and achieving state-of-the-art (SOTA) performance.

Abstract

Invariant Contrastive Learning (ICL) methods have achieved impressive performance across various domains. However, the absence of latent space representation for distortion (augmentation)-related information in the latent space makes ICL sub-optimal regarding training efficiency and robustness in downstream tasks. Recent studies suggest that introducing equivariance into Contrastive Learning (CL) can improve overall performance. In this paper, we revisit the roles of augmentation strategies and equivariance in improving CL's efficacy. We propose CLeVER (Contrastive Learning Via Equivariant Representation), a novel equivariant contrastive learning framework compatible with augmentation strategies of arbitrary complexity for various mainstream CL backbone models. Experimental results demonstrate that CLeVER effectively extracts and incorporates equivariant information from practical natural images, thereby improving the training efficiency and robustness of baseline models in downstream tasks and achieving state-of-the-art (SOTA) performance. Moreover, we find that leveraging equivariant information extracted by CLeVER simultaneously enhances rotational invariance and sensitivity across experimental tasks, and helps stabilize the framework when handling complex augmentations, particularly for models with small-scale backbones.
Paper Structure (19 sections, 13 equations, 8 figures, 10 tables)

This paper contains 19 sections, 13 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: CLeVER can introduce a comprehensive robustness improvement for DINO.
  • Figure 2: A brief overview of CLeVER. (a) $f(\cdot)$ and $g(\cdot)$ are backbone models. In DINO, they are EMA-based (Exponential Moving Average) teacher-student relationships. All $z_{EF}$ represent Equivariant Factors in the latent space corresponding to transformation operations $t_1$ and $t_2$, and $z_{IR}$ is denotes invariant representation of the invariant semantics in the latent space. $h$ represents the projection head used in the pretext task. In CLeVER, the loss of contrastive learning ($L_{CL}$) of the baseline method, the loss of orthogonality ($L_{Orth}$), and the projection regularization loss ($L_{PReg}$) are used. (b) In downstream tasks, the disentangled invariant representation and equivariant factor from the pre-trained backbone are incorporated for inference and prediction.
  • Figure 3: Visualization of self-attention under various augmentation settings.
  • Figure 4: Qualitative performance of unsupervised saliency segmentation task.
  • Figure 5: Ablation studies on hyperparameters.
  • ...and 3 more figures