Table of Contents
Fetching ...

Self-Supervised Learning based on Transformed Image Reconstruction for Equivariance-Coherent Feature Representation

Qin Wang, Benjamin Bruns, Hanno Scharr, Kai Krajsek

TL;DR

The paper tackles the tendency of self-supervised learning (SSL) methods to emphasize invariance at the expense of useful equivariant representations. It proposes a transformation-based SSL framework that reconstructs intermediate transformed images, splitting encoder features into invariant and equivariant parts and learning the latter via two decoders with a reconstruction loss $L_{recon}$ alongside the standard SSL loss $L_{SSL}$ in a combined objective $L_{total} = L_{SSL} + \lambda L_{recon}$. The method achieves state-of-the-art performance on synthetic equivariance tasks and delivers strong downstream results on natural image tasks, especially when integrated with augmentation-based SSL baselines like iBOT and DINOv2, while also offering robust transfer to dense prediction tasks. Overall, it demonstrates that incorporating intermediate transformation reconstruction yields more complete feature representations by balancing invariance and equivariance, with practical benefits across a range of vision tasks and SSL frameworks.

Abstract

The equivariant behaviour of features is essential in many computer vision tasks, yet popular self-supervised learning (SSL) methods tend to constrain equivariance by design. We propose a self-supervised learning approach where the system learns transformations independently by reconstructing images that have undergone previously unseen transformations. Specifically, the model is tasked to reconstruct intermediate transformed images, e.g. translated or rotated images, without prior knowledge of these transformations. This auxiliary task encourages the model to develop equivariance-coherent features without relying on predefined transformation rules. To this end, we apply transformations to the input image, generating an image pair, and then split the extracted features into two sets per image. One set is used with a usual SSL loss encouraging invariance, the other with our loss based on the auxiliary task to reconstruct the intermediate transformed images. Our loss and the SSL loss are linearly combined with weighted terms. Evaluating on synthetic tasks with natural images, our proposed method strongly outperforms all competitors, regardless of whether they are designed to learn equivariance. Furthermore, when trained alongside augmentation-based methods as the invariance tasks, such as iBOT or DINOv2, we successfully learn a balanced combination of invariant and equivariant features. Our approach performs strong on a rich set of realistic computer vision downstream tasks, almost always improving over all baselines.

Self-Supervised Learning based on Transformed Image Reconstruction for Equivariance-Coherent Feature Representation

TL;DR

The paper tackles the tendency of self-supervised learning (SSL) methods to emphasize invariance at the expense of useful equivariant representations. It proposes a transformation-based SSL framework that reconstructs intermediate transformed images, splitting encoder features into invariant and equivariant parts and learning the latter via two decoders with a reconstruction loss alongside the standard SSL loss in a combined objective . The method achieves state-of-the-art performance on synthetic equivariance tasks and delivers strong downstream results on natural image tasks, especially when integrated with augmentation-based SSL baselines like iBOT and DINOv2, while also offering robust transfer to dense prediction tasks. Overall, it demonstrates that incorporating intermediate transformation reconstruction yields more complete feature representations by balancing invariance and equivariance, with practical benefits across a range of vision tasks and SSL frameworks.

Abstract

The equivariant behaviour of features is essential in many computer vision tasks, yet popular self-supervised learning (SSL) methods tend to constrain equivariance by design. We propose a self-supervised learning approach where the system learns transformations independently by reconstructing images that have undergone previously unseen transformations. Specifically, the model is tasked to reconstruct intermediate transformed images, e.g. translated or rotated images, without prior knowledge of these transformations. This auxiliary task encourages the model to develop equivariance-coherent features without relying on predefined transformation rules. To this end, we apply transformations to the input image, generating an image pair, and then split the extracted features into two sets per image. One set is used with a usual SSL loss encouraging invariance, the other with our loss based on the auxiliary task to reconstruct the intermediate transformed images. Our loss and the SSL loss are linearly combined with weighted terms. Evaluating on synthetic tasks with natural images, our proposed method strongly outperforms all competitors, regardless of whether they are designed to learn equivariance. Furthermore, when trained alongside augmentation-based methods as the invariance tasks, such as iBOT or DINOv2, we successfully learn a balanced combination of invariant and equivariant features. Our approach performs strong on a rich set of realistic computer vision downstream tasks, almost always improving over all baselines.

Paper Structure

This paper contains 31 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of our proposed framework. Transformation $g$ with example of rotation with angle $\theta$. The common augmentation pipeline is as DINOv2, but on $v_2$, a transformation is applied along with common augmentation.
  • Figure 2: Synthetic tasks for evaluating equivariant representations. The transformation ($g$) is applied to the original image $I$, and both the transformed and original images are processed through a pretrained encoder $f$. A lightweight MLP $h$ then predicts the parameters of the applied transformation.
  • Figure 3: Absolute accuracy difference compared to iBOT among downstream tasks. Pretrained with SE(2) reconstruction.
  • Figure 4: Absolute accuracy difference compared to DINOv2 among downstream tasks. Pretrained with SE(2) reconstruction.
  • Figure 5: Synthetic tasks results among SSL methods
  • ...and 4 more figures