Table of Contents
Fetching ...

Self-supervised Transformation Learning for Equivariant Representations

Jaemyung Yu, Jaehyun Choi, Dong-Jae Lee, HyeongGwon Hong, Junmo Kim

TL;DR

Self-supervised Transformation Learning (STL) replaces explicit transformation labels with learned transformation representations to enable both invariant and equivariant learning without extra batch complexity. By introducing a transformation representation encoder and a self-supervised alignment objective, STL captures interdependencies among transformations and learns corresponding equivariant mappings in representation space. Across diverse classification and detection tasks, STL achieves state-of-the-art or competitive results, particularly excelling when integrated with AugMix. This approach offers a flexible, broadly applicable framework for robust, transformation-aware representation learning with compatibility across multiple base models.

Abstract

Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at https://github.com/jaemyung-u/stl.

Self-supervised Transformation Learning for Equivariant Representations

TL;DR

Self-supervised Transformation Learning (STL) replaces explicit transformation labels with learned transformation representations to enable both invariant and equivariant learning without extra batch complexity. By introducing a transformation representation encoder and a self-supervised alignment objective, STL captures interdependencies among transformations and learns corresponding equivariant mappings in representation space. Across diverse classification and detection tasks, STL achieves state-of-the-art or competitive results, particularly excelling when integrated with AugMix. This approach offers a flexible, broadly applicable framework for robust, transformation-aware representation learning with compatibility across multiple base models.

Abstract

Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at https://github.com/jaemyung-u/stl.
Paper Structure (24 sections, 18 equations, 5 figures, 11 tables)

This paper contains 24 sections, 18 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Visualization of Equivariant Transformation and Transformation Representation. (Left) UMAP umap visualizations of functional weights from equivariant transformations implemented with a hypernetwork. EquiMod uses transformation labels to generate these weights, while STL derives them from the representation pairs of transformed and original image. (Right) UMAP visualizations of transformation representations obtained from representation pairs of original input image and transformed input image.
  • Figure 2: Transformation Equivariant Learning with Self-supervised Transformation Learning. (Left) The overall framework of STL. For given image and transformations, it demonstrates: 1) transformation invariant learning, which aligns the representations of image and transformed image; 2) transformation equivariant learning, where the representation of image transformed by an equivariant transformation (obtained from the transformation representation of different image with the same applied transformation) aligns with the transformed image's representation; 3) self-supervised transformation learning, which aligns the transformation representations obtained from different image pairs. (Right) It illustrates the transformations of each representation and the equivariant transformations within the representation space.
  • Figure 3: Aligned Transformed Batch. (Left) In self-supervised learning methods, batch compositions typically involve applying two different transformations to each input image. (Right) In STL, batches are composed by pairing two images together, and applying the same transformation pair.
  • Figure 4: Visualization of Transformation Representations by Intensity. UMAP visualization of transformation representations organized by intensity levels for each transformation type, including random crop and color jitter variations in brightness, contrast, saturation, and hue. Parameter ranges for each transformation are divided into four segments to apply varying intensities, with darker colors representing higher intensities. Representations are captured by a ResNet-18 model pretrained on STL10 with a transformation backbone.
  • Figure 5: Explicit and Implicit Equivariant Learning. Transformation equivariant learning with transformation labels is divided into (Left) explicit and (Right) implicit equivariant learning.