Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

Varun Belagali; Srikar Yellapragada; Alexandros Graikos; Saarthak Kapse; Zilinghan Li; Tarak Nath Nandi; Ravi K Madduri; Prateek Prasanna; Joel Saltz; Dimitris Samaras

Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

Varun Belagali, Srikar Yellapragada, Alexandros Graikos, Saarthak Kapse, Zilinghan Li, Tarak Nath Nandi, Ravi K Madduri, Prateek Prasanna, Joel Saltz, Dimitris Samaras

TL;DR

Gen-SIS introduces a self-contained diffusion-based data augmentation framework for self-supervised learning that relies solely on unlabeled data. By training an embedding-conditioned Latent Diffusion Model on SSL embeddings, it generates self-augmentations (generative and interpolated) that enrich view diversity without external supervision, and pairs these augmentations with a novel disentanglement pretext task to force the encoder to separate mixed concepts. Empirically, Gen-DINO improves on ImageNet-1K across k-NN, linear probing, retrieval, and video segmentation, and extends to histopathology where gains are observed on PANDA and BRIGHT datasets. The approach also demonstrates robustness to distribution shifts and highlights the importance of conditioning space interpolation over pixel-space methods, offering a practical, domain-agnostic path to stronger SSL representations.

Abstract

Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations such as random cropping and color jittering to create multiple views of an image. Recently, generative diffusion models have been shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train an initial SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Following training, given an embedding of the source image, this diffusion model can synthesize its diverse views. We show that these `self-augmentations', i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder. Furthermore, based on the ability to interpolate between images in the encoder latent space, we introduce the novel pretext task of disentangling the two source images of an interpolated synthetic image. We validate Gen-SIS's effectiveness by demonstrating performance improvements across various downstream tasks in both natural images, which are generally object-centric, as well as digital histopathology images, which are typically context-based.

Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

TL;DR

Abstract

Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)