Table of Contents
Fetching ...

GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

Xiaojie Li, Yibo Yang, Xiangtai Li, Jianlong Wu, Yue Yu, Bernard Ghanem, Min Zhang

TL;DR

GenView tackles the limitations of traditional positive-view construction in self-supervised learning by enabling controllable, semantically faithful diversity through a pretrained image-conditioned diffusion model. It introduces adaptive view generation that uses the foreground content of an image to set the diffusion perturbation via $l_i^{ada}$, and a quality-driven contrastive loss that reweights positive pairs using $q_i = s_i^f - s_i^b$ and $w_i$ proportional to the softmax of $q_i$, yielding a reweighted loss. It demonstrably improves MoCov2, MoCov3, BYOL, and other SSL baselines on ImageNet linear and semi-supervised tasks, often surpassing naive data expansion like Laion400M or IN-21K with a modest synthetic-data budget, and it also boosts downstream MS-COCO detection/segmentation. Overall, GenView shows that controlled, high-quality generative view construction can meaningfully enhance representation learning across tasks without requiring massive labeled or unlabeled data.

Abstract

Self-supervised learning has achieved remarkable success in acquiring high-quality representations from unlabeled data. The widely adopted contrastive learning framework aims to learn invariant representations by minimizing the distance between positive views originating from the same image. However, existing techniques to construct positive views highly rely on manual transformations, resulting in limited diversity and potentially false positive pairs. To tackle these challenges, we present GenView, a controllable framework that augments the diversity of positive views leveraging the power of pretrained generative models while preserving semantics. We develop an adaptive view generation method that dynamically adjusts the noise level in sampling to ensure the preservation of essential semantic meaning while introducing variability. Additionally, we introduce a quality-driven contrastive loss, which assesses the quality of positive pairs by considering both foreground similarity and background diversity. This loss prioritizes the high-quality positive pairs we construct while reducing the influence of low-quality pairs, thereby mitigating potential semantic inconsistencies introduced by generative models and aggressive data augmentation. Thanks to the improved positive view quality and the quality-driven contrastive loss, GenView significantly improves self-supervised learning across various tasks. For instance, GenView improves MoCov2 performance by 2.5%/2.2% on ImageNet linear/semi-supervised classification. Moreover, GenView even performs much better than naively augmenting the ImageNet dataset with Laion400M or ImageNet21K. Code: https://github.com/xiaojieli0903/genview.

GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

TL;DR

GenView tackles the limitations of traditional positive-view construction in self-supervised learning by enabling controllable, semantically faithful diversity through a pretrained image-conditioned diffusion model. It introduces adaptive view generation that uses the foreground content of an image to set the diffusion perturbation via , and a quality-driven contrastive loss that reweights positive pairs using and proportional to the softmax of , yielding a reweighted loss. It demonstrably improves MoCov2, MoCov3, BYOL, and other SSL baselines on ImageNet linear and semi-supervised tasks, often surpassing naive data expansion like Laion400M or IN-21K with a modest synthetic-data budget, and it also boosts downstream MS-COCO detection/segmentation. Overall, GenView shows that controlled, high-quality generative view construction can meaningfully enhance representation learning across tasks without requiring massive labeled or unlabeled data.

Abstract

Self-supervised learning has achieved remarkable success in acquiring high-quality representations from unlabeled data. The widely adopted contrastive learning framework aims to learn invariant representations by minimizing the distance between positive views originating from the same image. However, existing techniques to construct positive views highly rely on manual transformations, resulting in limited diversity and potentially false positive pairs. To tackle these challenges, we present GenView, a controllable framework that augments the diversity of positive views leveraging the power of pretrained generative models while preserving semantics. We develop an adaptive view generation method that dynamically adjusts the noise level in sampling to ensure the preservation of essential semantic meaning while introducing variability. Additionally, we introduce a quality-driven contrastive loss, which assesses the quality of positive pairs by considering both foreground similarity and background diversity. This loss prioritizes the high-quality positive pairs we construct while reducing the influence of low-quality pairs, thereby mitigating potential semantic inconsistencies introduced by generative models and aggressive data augmentation. Thanks to the improved positive view quality and the quality-driven contrastive loss, GenView significantly improves self-supervised learning across various tasks. For instance, GenView improves MoCov2 performance by 2.5%/2.2% on ImageNet linear/semi-supervised classification. Moreover, GenView even performs much better than naively augmenting the ImageNet dataset with Laion400M or ImageNet21K. Code: https://github.com/xiaojieli0903/genview.
Paper Structure (32 sections, 14 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 32 sections, 14 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: The motivation of GenView: (a) and (b) show standard augmentation-based positive pairs, while (c) and (d) are GenView-constructed pairs. Standard augmentations may cause false positive pair (a) or less diverse pair (b). As a comparison, GenView preserves subject semantics with variations (c and d) and assesses the pair quality to guide contrastive learning.
  • Figure 2: GenView is composed of a view quality enhancement framework, an adaptive view generation method to balance diversity and semantic fidelity, and a quality-driven contrastive loss mechanism. The framework generates the enhanced view by passing the noisy image embedding, which is extracted from the frozen CLIP encoder, to the image-conditioned pretrained generative models (the Stable Diffusion generator). Positive views are passed through encoders to compute the contrastive loss, with an emphasis on those high-quality positive pairs. The encoders $f$ can be the same encoder or different ones, e.g. an encoder and its momentum-updated one. All the pretrained CLIP encoder and Stable Diffusion have not accessed the dataset for SSL.
  • Figure 3: Illustration of our adaptive view generation. For the images with lower foreground proportion, a lower noise level is selected (in blue) because a higher noise level could easily result in synthetic images whose semantic contents are changed (1st column), disappeared (2nd column), or distorted (3rd column). For the images with higher foreground proportion, a higher noise level is favored (in green) to introduce diversity, e.g. different pose (4th column), action (5th column), and background (6th column).
  • Figure 4: The positive pair of views constructed by GenView conditioned on images from IN-1K, and CF10.
  • Figure 5: Visualization of positive pairs generated by GenView, depicting successful variations and failure cases (outlined in red).