Table of Contents
Fetching ...

Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations

Manoj Kumar, Neil Houlsby, Emiel Hoogeboom

TL;DR

It is demonstrated that a diffusion model trained to reconstruct an input image from frozen embeddings, can reconstruct the image with minor variations and a new pretraining strategy to generate image variations using a large collection of image pairs is explored.

Abstract

Generating image variations, where a model produces variations of an input image while preserving the semantic context has gained increasing attention. Current image variation techniques involve adapting a text-to-image model to reconstruct an input image conditioned on the same image. We first demonstrate that a diffusion model trained to reconstruct an input image from frozen embeddings, can reconstruct the image with minor variations. Second, inspired by how text-to-image models learn from web-scale text-image pairs, we explore a new pretraining strategy to generate image variations using a large collection of image pairs. Our diffusion model \textit{Semantica} receives a random (encoded) image from a webpage as conditional input and denoises another noisy random image from the same webpage. We carefully examine various design choices for the image encoder, given its crucial role in extracting relevant context from the input image. Once trained, \textit{Semantica} can adaptively generate new images from a dataset by simply using images from that dataset as input. Finally, we identify limitations in standard image consistency metrics for evaluating image variations and propose alternative metrics based on few-shot generation.

Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations

TL;DR

It is demonstrated that a diffusion model trained to reconstruct an input image from frozen embeddings, can reconstruct the image with minor variations and a new pretraining strategy to generate image variations using a large collection of image pairs is explored.

Abstract

Generating image variations, where a model produces variations of an input image while preserving the semantic context has gained increasing attention. Current image variation techniques involve adapting a text-to-image model to reconstruct an input image conditioned on the same image. We first demonstrate that a diffusion model trained to reconstruct an input image from frozen embeddings, can reconstruct the image with minor variations. Second, inspired by how text-to-image models learn from web-scale text-image pairs, we explore a new pretraining strategy to generate image variations using a large collection of image pairs. Our diffusion model \textit{Semantica} receives a random (encoded) image from a webpage as conditional input and denoises another noisy random image from the same webpage. We carefully examine various design choices for the image encoder, given its crucial role in extracting relevant context from the input image. Once trained, \textit{Semantica} can adaptively generate new images from a dataset by simply using images from that dataset as input. Finally, we identify limitations in standard image consistency metrics for evaluating image variations and propose alternative metrics based on few-shot generation.
Paper Structure (30 sections, 3 equations, 12 figures, 5 tables)

This paper contains 30 sections, 3 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Each grid presents a conditioning image at the top followed by $512 \times 512$ image variations generated by Semantica, IP-Adapter, and SDv2 IV. Samples generated by a semantic image-variation model should maintain semantic consistency with the conditioning image while also being sufficiently diverse. Semantica demonstrates greater diversity than IP-Adapter while preserving semantic context. While SD v2 generates diverse outputs, the generated outputs are often not congruent with the context image. Additional samples are present in App. \ref{['app:model_comparison']}.
  • Figure 2: A conditional diffusion model reconstructs images from frozen DINOv2 embeddings. Left: Input Images. Right: Three samples from the trained diffusion model with guidance 0.0 exhibiting low-level variation but lacking high-level variation.
  • Figure 3: Comparison of Semantica against three state-of-the-art image variation baselines on one-shot ImageNet, using evaluation metrics: FID (Left Table) and Precision-Recall (Right Plot:) as evaluation metrics. Each point in Fig. 3 Right represents a different guidance factor. Semantica outperforms image-variation baselines achieving lower FID and a better precision-recall tradeoff.
  • Figure 4: We present additional samples and comparisons on ImageNet. Samples from Semantica reflect diversity while being congruent with the conditioning image.
  • Figure 5: A conditional diffusion model reconstructs images from frozen SigLIP embeddings. As seen in the case with frozen DINOv2 embeddings in Fig.\ref{['fig:reconstruct']}, the generated samples exhibit very minor low-level variations.
  • ...and 7 more figures