Equivariant Representation Learning for Augmentation-based Self-Supervised Learning via Image Reconstruction
Qin Wang, Kai Krajsek, Hanno Scharr
TL;DR
The paper addresses the limitation of augmentation-based self-supervised learning methods that focus on invariant features by introducing a reconstruction-based auxiliary task to promote equivariant representations. It extends the SIE framework with a split of encoder features into invariant and equivariant parts and a cross-attention reconstruction decoder that fuses features from two augmented views, optimizing a joint loss that includes a reconstruction term. Empirically, the method achieves parity with SIE on the synthetic 3DIEBench dataset and consistently improves performance on ImageNet and related transfer tasks, particularly when multiple augmentations are used. This approach offers a practical path to more robust and generalizable visual representations without requiring prior knowledge of transformation parameters at training time.
Abstract
Augmentation-based self-supervised learning methods have shown remarkable success in self-supervised visual representation learning, excelling in learning invariant features but often neglecting equivariant ones. This limitation reduces the generalizability of foundation models, particularly for downstream tasks requiring equivariance. We propose integrating an image reconstruction task as an auxiliary component in augmentation-based self-supervised learning algorithms to facilitate equivariant feature learning without additional parameters. Our method implements a cross-attention mechanism to blend features learned from two augmented views, subsequently reconstructing one of them. This approach is adaptable to various datasets and augmented-pair based learning methods. We evaluate its effectiveness on learning equivariant features through multiple linear regression tasks and downstream applications on both artificial (3DIEBench) and natural (ImageNet) datasets. Results consistently demonstrate significant improvements over standard augmentation-based self-supervised learning methods and state-of-the-art approaches, particularly excelling in scenarios involving combined augmentations. Our method enhances the learning of both invariant and equivariant features, leading to more robust and generalizable visual representations for computer vision tasks.
