Table of Contents
Fetching ...

Generative Models as a Data Source for Multiview Representation Learning

Ali Jahanian, Xavier Puig, Yonglong Tian, Phillip Isola

TL;DR

This work investigates learning visual representations using only samples from a pre-trained, black-box generative model, without access to its training data. It adapts contrastive learning to generate multiple views via latent-space perturbations in addition to standard pixel-space augmentations, and compares with non-contrastive baselines. The findings show that latent views, particularly Gaussian perturbations, can significantly boost transfer performance and, with high-quality generators like StyleGAN2, even rival representations learned from real data; results also reveal an inverse-U effect for latent perturbation magnitude and sub-logarithmic gains with more synthetic samples. The study suggests that generative models can serve as compressed, privacy-friendly data sources for representation learning and outlines practical guidelines for sampling strategies and method choices in such futures.

Abstract

Generative models are now capable of producing highly realistic images that look nearly indistinguishable from the data on which they are trained. This raises the question: if we have good enough generative models, do we still need datasets? We investigate this question in the setting of learning general-purpose visual representations from a black-box generative model rather than directly from data. Given an off-the-shelf image generator without any access to its training data, we train representations from the samples output by this generator. We compare several representation learning methods that can be applied to this setting, using the latent space of the generator to generate multiple "views" of the same semantic content. We show that for contrastive methods, this multiview data can naturally be used to identify positive pairs (nearby in latent space) and negative pairs (far apart in latent space). We find that the resulting representations rival or even outperform those learned directly from real data, but that good performance requires care in the sampling strategy applied and the training method. Generative models can be viewed as a compressed and organized copy of a dataset, and we envision a future where more and more "model zoos" proliferate while datasets become increasingly unwieldy, missing, or private. This paper suggests several techniques for dealing with visual representation learning in such a future. Code is available on our project page https://ali-design.github.io/GenRep/.

Generative Models as a Data Source for Multiview Representation Learning

TL;DR

This work investigates learning visual representations using only samples from a pre-trained, black-box generative model, without access to its training data. It adapts contrastive learning to generate multiple views via latent-space perturbations in addition to standard pixel-space augmentations, and compares with non-contrastive baselines. The findings show that latent views, particularly Gaussian perturbations, can significantly boost transfer performance and, with high-quality generators like StyleGAN2, even rival representations learned from real data; results also reveal an inverse-U effect for latent perturbation magnitude and sub-logarithmic gains with more synthetic samples. The study suggests that generative models can serve as compressed, privacy-friendly data sources for representation learning and outlines practical guidelines for sampling strategies and method choices in such futures.

Abstract

Generative models are now capable of producing highly realistic images that look nearly indistinguishable from the data on which they are trained. This raises the question: if we have good enough generative models, do we still need datasets? We investigate this question in the setting of learning general-purpose visual representations from a black-box generative model rather than directly from data. Given an off-the-shelf image generator without any access to its training data, we train representations from the samples output by this generator. We compare several representation learning methods that can be applied to this setting, using the latent space of the generator to generate multiple "views" of the same semantic content. We show that for contrastive methods, this multiview data can naturally be used to identify positive pairs (nearby in latent space) and negative pairs (far apart in latent space). We find that the resulting representations rival or even outperform those learned directly from real data, but that good performance requires care in the sampling strategy applied and the training method. Generative models can be viewed as a compressed and organized copy of a dataset, and we envision a future where more and more "model zoos" proliferate while datasets become increasingly unwieldy, missing, or private. This paper suggests several techniques for dealing with visual representation learning in such a future. Code is available on our project page https://ali-design.github.io/GenRep/.

Paper Structure

This paper contains 39 sections, 8 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Visual representation learning typically consists of training an image embedding function, $F:\mathbf{x} \rightarrow \mathbf{e}$, given a dataset of real images $\{\mathbf{x}_i\}^{N}_{i=1}$ (left panel). In our work (right panel), we study how to learn representations given instead a black-box generative model $G$. Generative models allow us to sample continuous streams of synthetic data. By applying transformations $T_\mathbf{z}$ on the latent vectors $\mathbf{z}$ of the model, we can create multiple data "views" that can serve as effective training data for representation learners.
  • Figure 2: Different ways of creating multiple views of the same "content". (a) SimCLR chen2020simple creates views by transforming an input image with standard pixel-space ($\mathcal{X}$) data augmentations (example images taken from chen2020simple). (b) With a generative model, we can instead create views by sampling nearby points in latent space $\mathcal{Z}$, exploiting the fact that nearby points in latent space tend to generate imagery of the same semantic object. Note that these examples are illustrative, the actual transformations that achieve the best results are shown in Fig. \ref{['fig:biggan_samples']}.
  • Figure 3: Three different methods for learning representations. The first row illustrates a standard contrastive learning framework (e.g., SimCLR chen2020simple) in which positive pairs as sampled as transformations of training images $\mathbf{x}$. The second and third rows show the new setting we consider: we are given a generator, rather than a dataset, and can use the latent space (input) of the generator to control the production of effective training data. $T_{\mathbf{x}}$ refers to transformations applied in pixel-space and $T_{\mathbf{z}}$ denotes transformations in latent-space. The second row illustrates a contrastive learning approach in this setting and the third row shows an approach that simply inverts the generative model. For contrastive learning, negatives are omitted for clarity.
  • Figure 4: Examples of different transformation methods for unconditional IGM data. Top row shows samples of BigBiGAN trained on ImageNet1000, and the bottom row shows samples from the StyleGAN2 LSUN Car.
  • Figure 5: Effect of the distance between latent views on contrastive learning. We vary the standard deviation of a Gaussian $T_{\mathbf{z}}=\mathbf{z} + \mathbf{w}_{\texttt{Gauss}}$ and measure linear transfer to ImageNet100.
  • ...and 6 more figures