Table of Contents
Fetching ...

Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

Varun Belagali, Srikar Yellapragada, Alexandros Graikos, Saarthak Kapse, Zilinghan Li, Tarak Nath Nandi, Ravi K Madduri, Prateek Prasanna, Joel Saltz, Dimitris Samaras

TL;DR

Gen-SIS introduces a self-contained diffusion-based data augmentation framework for self-supervised learning that relies solely on unlabeled data. By training an embedding-conditioned Latent Diffusion Model on SSL embeddings, it generates self-augmentations (generative and interpolated) that enrich view diversity without external supervision, and pairs these augmentations with a novel disentanglement pretext task to force the encoder to separate mixed concepts. Empirically, Gen-DINO improves on ImageNet-1K across k-NN, linear probing, retrieval, and video segmentation, and extends to histopathology where gains are observed on PANDA and BRIGHT datasets. The approach also demonstrates robustness to distribution shifts and highlights the importance of conditioning space interpolation over pixel-space methods, offering a practical, domain-agnostic path to stronger SSL representations.

Abstract

Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations such as random cropping and color jittering to create multiple views of an image. Recently, generative diffusion models have been shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train an initial SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Following training, given an embedding of the source image, this diffusion model can synthesize its diverse views. We show that these `self-augmentations', i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder. Furthermore, based on the ability to interpolate between images in the encoder latent space, we introduce the novel pretext task of disentangling the two source images of an interpolated synthetic image. We validate Gen-SIS's effectiveness by demonstrating performance improvements across various downstream tasks in both natural images, which are generally object-centric, as well as digital histopathology images, which are typically context-based.

Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

TL;DR

Gen-SIS introduces a self-contained diffusion-based data augmentation framework for self-supervised learning that relies solely on unlabeled data. By training an embedding-conditioned Latent Diffusion Model on SSL embeddings, it generates self-augmentations (generative and interpolated) that enrich view diversity without external supervision, and pairs these augmentations with a novel disentanglement pretext task to force the encoder to separate mixed concepts. Empirically, Gen-DINO improves on ImageNet-1K across k-NN, linear probing, retrieval, and video segmentation, and extends to histopathology where gains are observed on PANDA and BRIGHT datasets. The approach also demonstrates robustness to distribution shifts and highlights the importance of conditioning space interpolation over pixel-space methods, offering a practical, domain-agnostic path to stronger SSL representations.

Abstract

Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations such as random cropping and color jittering to create multiple views of an image. Recently, generative diffusion models have been shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train an initial SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Following training, given an embedding of the source image, this diffusion model can synthesize its diverse views. We show that these `self-augmentations', i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder. Furthermore, based on the ability to interpolate between images in the encoder latent space, we introduce the novel pretext task of disentangling the two source images of an interpolated synthetic image. We validate Gen-SIS's effectiveness by demonstrating performance improvements across various downstream tasks in both natural images, which are generally object-centric, as well as digital histopathology images, which are typically context-based.

Paper Structure

This paper contains 23 sections, 5 equations, 12 figures, 11 tables, 1 algorithm.

Figures (12)

  • Figure 1: (a) Vanilla augmentations used in SSL such as random cropping, color jittering. (b) Generative augmentations (ours) are conditioned on a single source image. (c) Interpolated augmentations (ours) conditioned on a pair of images. In the Gen-SIS framework, we use (b) for view augmentation, and (c) for the disentanglement pretext task, both in conjunction with (a).
  • Figure 2: Overview of the Gen-SIS-framework: It contains 2 key steps 1) Self-Augmentation using Embedding conditioned LDM (E-LDM), 2) SSL training with augmentations from E-LDM. $T$ represents vanilla augmentations, $T_{s}$ represents generative augmentation from single image, and $T_i$ represents interpolated augmentation from two images. Note that in conjunction with $T_{s}$ and $T_i$, we applied vanilla augmentation. Pull represents the vanilla SSL pretext task, and Disentangle represents our proposed pretext task with interpolated augmentation.
  • Figure 3: [CLS] token attention map of DINO and Gen-DINO averaged across all heads and overlayed on real and interpolated image. Gen-DINO's attention covers higher portion of object patches than DINO.
  • Figure 4: Interpolated augmentations ($\alpha=\{0.2, 0.4, 0.6, 0.8\}$) generated from 2 real images ($\alpha$=0 and $\alpha$=1). An example of interpolation between dog and stone image from ImageNet dataset is illustrated.
  • Figure 5: Interpolated augmentation using Gen-SIS framework (Ours) vs pixel-level interpolation. Image 1 and Image 2 are the source images used for interpolation ($\alpha=0.5$).
  • ...and 7 more figures