Table of Contents
Fetching ...

Contrastive Learning with Synthetic Positives

Dewen Zeng, Yawen Wu, Xinrong Hu, Xiaowei Xu, Yiyu Shi

TL;DR

This work addresses the limited diversity of positives in contrastive self-supervised learning by replacing only easy nearest-neighbor positives with synthetic positives generated by an unconditional diffusion model. It introduces CLSP, which uses feature interpolation $h = w \cdot h + (1-w) \cdot h_{anchor}$ to produce hard positives $x_i^3$ and augments the standard loss with $L = \sum_{i,k} L_{i,k} + \lambda \sum_i (z_i^2 - z_i^3)^2$, leveraging a pre-generated candidate set of size $k \le 8$. Empirically, CLSP variants outperform strong baselines on CIFAR10/100, STL10, and ImageNet100 in linear and transfer evaluations, with notable gains such as around 2.9 percentage points on CIFAR10 and, for CIFAR100, up to about 6.2 points, and 6/8 downstream improvements in transfer tasks. The results position diffusion-guided synthetic positives as a robust baseline for diffusion-assisted SSL and encourage scaling to larger datasets and broader tasks.

Abstract

Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques by utilizing the similarity of multiple instances within the same class. However, its efficacy is constrained as the nearest neighbor algorithm primarily identifies "easy" positive pairs, where the representations are already closely located in the embedding space. In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (CLSP) that utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives. Through feature interpolation in the diffusion model sampling process, we generate images with distinct backgrounds yet similar semantic content to the anchor image. These images are considered "hard" positives for the anchor image, and when included as supplementary positives in the contrastive loss, they contribute to a performance improvement of over 2% and 1% in linear evaluation compared to the previous NNCLR and All4One methods across multiple benchmark datasets such as CIFAR10, achieving state-of-the-art methods. On transfer learning benchmarks, CLSP outperforms existing SSL frameworks on 6 out of 8 downstream datasets. We believe CLSP establishes a valuable baseline for future SSL studies incorporating synthetic data in the training process.

Contrastive Learning with Synthetic Positives

TL;DR

This work addresses the limited diversity of positives in contrastive self-supervised learning by replacing only easy nearest-neighbor positives with synthetic positives generated by an unconditional diffusion model. It introduces CLSP, which uses feature interpolation to produce hard positives and augments the standard loss with , leveraging a pre-generated candidate set of size . Empirically, CLSP variants outperform strong baselines on CIFAR10/100, STL10, and ImageNet100 in linear and transfer evaluations, with notable gains such as around 2.9 percentage points on CIFAR10 and, for CIFAR100, up to about 6.2 points, and 6/8 downstream improvements in transfer tasks. The results position diffusion-guided synthetic positives as a robust baseline for diffusion-assisted SSL and encourage scaling to larger datasets and broader tasks.

Abstract

Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques by utilizing the similarity of multiple instances within the same class. However, its efficacy is constrained as the nearest neighbor algorithm primarily identifies "easy" positive pairs, where the representations are already closely located in the embedding space. In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (CLSP) that utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives. Through feature interpolation in the diffusion model sampling process, we generate images with distinct backgrounds yet similar semantic content to the anchor image. These images are considered "hard" positives for the anchor image, and when included as supplementary positives in the contrastive loss, they contribute to a performance improvement of over 2% and 1% in linear evaluation compared to the previous NNCLR and All4One methods across multiple benchmark datasets such as CIFAR10, achieving state-of-the-art methods. On transfer learning benchmarks, CLSP outperforms existing SSL frameworks on 6 out of 8 downstream datasets. We believe CLSP establishes a valuable baseline for future SSL studies incorporating synthetic data in the training process.
Paper Structure (16 sections, 3 equations, 8 figures, 11 tables)

This paper contains 16 sections, 3 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: (a) Overview of the proposed CLSP framework, we use a diffusion model to generate an additional positive $x_i^3$ to increase the positive diversity for better representation learning. (b) The t-SNE plot of features extracted from the intermediate layer of the diffusion model trained on CIFAR10. The features are generated at timestamp 50. (c) The generated images only contain background information if intermediate features are masked, suggesting the decoupling of semantic and background information in different layers of the diffusion model. (d) Using feature interpolation to generate hard positives, the generated images contain similar semantic content to the anchor image but differ in context and background.
  • Figure 2: Feature similarity of original positive pairs (left) and additional positive pairs (right) on CIFAR10. The original positive pair is the two augmented views, and the additional positive pair is one of the augmented views with the synthetic positive.
  • Figure 3: The correlation of image generation quality with CLSP-SimCLR and CLSP-MoCoV2 linear evaluation performance on CIFAR10 and CIFAR100 datasets.
  • Figure 4: (a) Generated images under different feature interpolation weights. (b) The correlation of linear classification accuracy with $w$ on CIFAR10 and CIFAR100 datasets
  • Figure A.1: Alternative positive generation methods. (a) RCG, using a pre-trained SSL encoder to generate semantic embedding of the anchor image as the condition to guide diffusion sampling. (b) RCG-cluster, using an unsupervised cluster head to cluster the embeddings in RCG and then using the cluster output as the condition.
  • ...and 3 more figures