Table of Contents
Fetching ...

Do Generated Data Always Help Contrastive Learning?

Yifei Wang, Jizhe Zhang, Yisen Wang

TL;DR

This work reveals the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa, and proposes Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost.

Abstract

Contrastive Learning (CL) has emerged as one of the most successful paradigms for unsupervised visual representation learning, yet it often depends on intensive manual data augmentations. With the rise of generative models, especially diffusion models, the ability to generate realistic images close to the real data distribution has been well recognized. These generated high-equality images have been successfully applied to enhance contrastive representation learning, a technique termed ``data inflation''. However, we find that the generated data (even from a good diffusion model like DDPM) may sometimes even harm contrastive learning. We investigate the causes behind this failure from the perspective of both data inflation and data augmentation. For the first time, we reveal the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa. We also provide rigorous theoretical explanations for these phenomena via deriving its generalization bounds under data inflation. Drawing from these insights, we propose Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost. On benchmark datasets, AdaInf can bring significant improvements for various contrastive learning methods. Notably, without using external data, AdaInf obtains 94.70% linear accuracy on CIFAR-10 with SimCLR, setting a new record that surpasses many sophisticated methods. Code is available at https://github.com/PKU-ML/adainf.

Do Generated Data Always Help Contrastive Learning?

TL;DR

This work reveals the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa, and proposes Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost.

Abstract

Contrastive Learning (CL) has emerged as one of the most successful paradigms for unsupervised visual representation learning, yet it often depends on intensive manual data augmentations. With the rise of generative models, especially diffusion models, the ability to generate realistic images close to the real data distribution has been well recognized. These generated high-equality images have been successfully applied to enhance contrastive representation learning, a technique termed ``data inflation''. However, we find that the generated data (even from a good diffusion model like DDPM) may sometimes even harm contrastive learning. We investigate the causes behind this failure from the perspective of both data inflation and data augmentation. For the first time, we reveal the complementary roles that stronger data inflation should be accompanied by weaker augmentations, and vice versa. We also provide rigorous theoretical explanations for these phenomena via deriving its generalization bounds under data inflation. Drawing from these insights, we propose Adaptive Inflation (AdaInf), a purely data-centric strategy without introducing any extra computation cost. On benchmark datasets, AdaInf can bring significant improvements for various contrastive learning methods. Notably, without using external data, AdaInf obtains 94.70% linear accuracy on CIFAR-10 with SimCLR, setting a new record that surpasses many sophisticated methods. Code is available at https://github.com/PKU-ML/adainf.
Paper Structure (19 sections, 4 theorems, 13 equations, 13 figures, 6 tables)

This paper contains 19 sections, 4 theorems, 13 equations, 13 figures, 6 tables.

Key Result

Theorem 3.1

$\mathop{\mathrm{D_{TV}}}\limits(P_t,P_d)=(1-\beta)\mathop{\mathrm{D_{TV}}}\limits(P_g,P_d)$, where $\mathop{\mathrm{D_{TV}}}\limits$ denotes the TV distance.

Figures (13)

  • Figure 1: \ref{['fig:flow']}: During data inflation, the real data and the generated data (usually with a larger size) are combined together as the training data for contrastive learning, where two random augmentations are drawn from each sample to compute the contrastive loss. \ref{['fig:counterexample']}: Linear accuracy of contrastive learning simclr with different data inflation strategies on CIFAR-10. The generated data are 1M samples drawn from DDPM (with 3.04 FID) or STF (with 1.94 FID).
  • Figure 2: Performance of contrastive learning with 1M generated data on CIFAR-10. \ref{['fig:fid']}: Linear accuracy using four diffusion models for generation: DDPM ho2020denoising, EDM Karras2022edm, STF stf (two EDM models differ in their training time). \ref{['fig:replicate']}: Linear accuracy with data reweighting (real data: generative data $=N:1$).
  • Figure 3: \ref{['fig crop_min_scale']}: Linear accuracy with different augmentation strengths, by changing the min scale of random resized cropping (lower value represents stronger augmentation). \ref{['fig different_fid_33']}: Linear accuracy of different inflation strategies on CIFAR-10 (with $10:1$ data reweighting).
  • Figure 4: Illustrative examples on the effect of data augmentation on the labeling error (\ref{['fig:label_error']}) (i.e., augmentated samples belonging to different classes) and graph connectivity (\ref{['fig:connectivity']}) (i.e., different samples bridged together after augmentation).
  • Figure 5: Analysis of the influence of data size (i.e., inflation) and augmentation strength ($r$) on two crucial factors in the generalization error, the labeling error $\alpha$ and the graph connectivity $\lambda_{k+1}$, on the synthetic dataset (Section \ref{['sec:synthetic']}). The optimal augmentation strengths are marked in red dots.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Theorem 3.1
  • Theorem 4.1
  • Lemma 4.2: Theorem 1.1 in chung2007spectral
  • proof
  • proof
  • Lemma D.1: Theorem B.3 in haochen