Table of Contents
Fetching ...

Soft Equivariance Regularization for Invariant Self-Supervised Learning

Joohyung Lee, Changhun Kim, Hyunsu Kim, Kwanhyung Lee, Juho Lee

TL;DR

Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced, keeps the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions applied directly in feature space.

Abstract

Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions $ρ_g$ applied directly in feature space. SER learns/predicts no per-sample transformation codes/labels, requires no auxiliary transformation-prediction head, and adds only 1.008x training FLOPs. On ImageNet-1k ViT-S/16 pretraining, SER improves MoCo-v3 by +0.84 Top-1 in linear evaluation under a strictly matched 2-view setting and consistently improves DINO and Barlow Twins; under matched view counts, SER achieves the best ImageNet-1k linear-eval Top-1 among the compared invariance+equivariance add-ons. SER further improves ImageNet-C/P by +1.11/+1.22 Top-1 and frozen-backbone COCO detection by +1.7 mAP. Finally, applying the same layer-decoupling recipe to existing invariance+equivariance baselinesimproves their accuracy, suggesting layer decoupling as a general design principle for combining invariance and equivariance.

Soft Equivariance Regularization for Invariant Self-Supervised Learning

TL;DR

Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced, keeps the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions applied directly in feature space.

Abstract

Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations. While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same final representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose Soft Equivariance Regularization (SER), a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an intermediate spatial token map via analytically specified group actions applied directly in feature space. SER learns/predicts no per-sample transformation codes/labels, requires no auxiliary transformation-prediction head, and adds only 1.008x training FLOPs. On ImageNet-1k ViT-S/16 pretraining, SER improves MoCo-v3 by +0.84 Top-1 in linear evaluation under a strictly matched 2-view setting and consistently improves DINO and Barlow Twins; under matched view counts, SER achieves the best ImageNet-1k linear-eval Top-1 among the compared invariance+equivariance add-ons. SER further improves ImageNet-C/P by +1.11/+1.22 Top-1 and frozen-backbone COCO detection by +1.7 mAP. Finally, applying the same layer-decoupling recipe to existing invariance+equivariance baselinesimproves their accuracy, suggesting layer decoupling as a general design principle for combining invariance and equivariance.
Paper Structure (42 sections, 10 equations, 5 figures, 17 tables)

This paper contains 42 sections, 10 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: Overview of SER. For each image in $b_2$, we sample two views from the equivariant-view policy $\mathcal{T}_{\mathrm{eq}}$. We decompose each sampled transform into a geometric component $g\in\mathcal{G}$ and a photometric component (e.g., color jitter), and denote by $g_1,g_2\in\mathcal{G}$ the geometric parts of the two views. We use the relative transform $g=g_2g_1^{-1}$ to align intermediate token maps in feature space. For clarity, the independently sampled photometric components in $\mathcal{T}_{\mathrm{eq}}$ are omitted in the diagram.
  • Figure 2: An overview of the training pipeline. The mini-batch is split into $b_1$ and $b_2$: $b_1$ uses the baseline SSL augmentation policy $\mathcal{T}$ (including cropping), while $b_2$ uses an equivariant-view policy $\mathcal{T}_{\mathrm{eq}}$ that disables cropping and adds discrete $90^\circ$ rotations while retaining the baseline photometric jitter. Both $b_1$ and $b_2$ contribute to the baseline invariance loss. SER additionally applies an equivariance regularizer to an intermediate spatial token map using only the invertible geometric component of the $b_2$ transforms to define the feature-space action $\rho_g$.
  • Figure 3: Ablation study on the location to regularize towards equivariance (left) and to insert the [CLS] token in the ViT encoder with fixed equivariance regularization layer at the 3rd layer (right). Both Top-1 (left) and Top-5 (right) accuracies peak when the equivariance loss and [CLS] is introduced near the middle of the network.
  • Figure 4: t-SNE visualization of latent space features from 20 randomly sampled ImageNet-1k classes, comparing (a) MoCo-v3 (trained with invariance loss alone) and (b) MoCo-v3 + Ours. Our method promotes better class clustering, demonstrating that incorporating equivariance benefits downstream tasks requiring invariance.
  • Figure 5: t-SNE visualization of latent space features from 20 randomly sampled ImageNet-C classes under defocus blur corruption, comparing (a) MoCo-v3 (trained with invariance loss alone) and (b) MoCo-v3 + Ours. Our method maintains better class clustering under corruption, demonstrating robustness benefits of incorporating equivariance.