Table of Contents
Fetching ...

Few-Shot Diffusion Models

Giorgio Giannone, Didrik Nielsen, Ole Winther

TL;DR

Few-Shot Diffusion Models (FSDM) introduce a set-conditioned diffusion framework that learns a context from a small support set X via a ViT-based set encoder to guide a conditional DDPM. By fusing the context c into the diffusion path through FiLM or cross-attention-based mechanisms, FSDM enables high-quality few-shot generation and transfer to new datasets, with faster convergence and improved conditioning than prior baselines. The approach is validated across diverse datasets, showing superior denoising and generation metrics, effective cross-dataset transfer, and favorable comparisons to test-time conditioning methods. A variational extension (VFSDM) is explored but found more challenging to train, highlighting the practical strength of deterministic set-conditioned conditioning. Overall, FSDM demonstrates a scalable, expressive method for rapid adaptation of diffusion models to novel concepts from limited examples, with broad applicability to different modalities via patch-based tokenization.

Abstract

Denoising diffusion probabilistic models (DDPM) are powerful hierarchical latent variable models with remarkable sample generation quality and training stability. These properties can be attributed to parameter sharing in the generative hierarchy, as well as a parameter-free diffusion-based inference procedure. In this paper, we present Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs. FSDMs are trained to adapt the generative process conditioned on a small set of images from a given class by aggregating image patch information using a set-based Vision Transformer (ViT). At test time, the model is able to generate samples from previously unseen classes conditioned on as few as 5 samples from that class. We empirically show that FSDM can perform few-shot generation and transfer to new datasets. We benchmark variants of our method on complex vision datasets for few-shot learning and compare to unconditional and conditional DDPM baselines. Additionally, we show how conditioning the model on patch-based input set information improves training convergence.

Few-Shot Diffusion Models

TL;DR

Few-Shot Diffusion Models (FSDM) introduce a set-conditioned diffusion framework that learns a context from a small support set X via a ViT-based set encoder to guide a conditional DDPM. By fusing the context c into the diffusion path through FiLM or cross-attention-based mechanisms, FSDM enables high-quality few-shot generation and transfer to new datasets, with faster convergence and improved conditioning than prior baselines. The approach is validated across diverse datasets, showing superior denoising and generation metrics, effective cross-dataset transfer, and favorable comparisons to test-time conditioning methods. A variational extension (VFSDM) is explored but found more challenging to train, highlighting the practical strength of deterministic set-conditioned conditioning. Overall, FSDM demonstrates a scalable, expressive method for rapid adaptation of diffusion models to novel concepts from limited examples, with broad applicability to different modalities via patch-based tokenization.

Abstract

Denoising diffusion probabilistic models (DDPM) are powerful hierarchical latent variable models with remarkable sample generation quality and training stability. These properties can be attributed to parameter sharing in the generative hierarchy, as well as a parameter-free diffusion-based inference procedure. In this paper, we present Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs. FSDMs are trained to adapt the generative process conditioned on a small set of images from a given class by aggregating image patch information using a set-based Vision Transformer (ViT). At test time, the model is able to generate samples from previously unseen classes conditioned on as few as 5 samples from that class. We empirically show that FSDM can perform few-shot generation and transfer to new datasets. We benchmark variants of our method on complex vision datasets for few-shot learning and compare to unconditional and conditional DDPM baselines. Additionally, we show how conditioning the model on patch-based input set information improves training convergence.
Paper Structure (28 sections, 9 equations, 10 figures, 4 tables)

This paper contains 28 sections, 9 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Set (left) and conditional samples (right) on CIFAR100 using a Few-Shot Diffusion Models. FSDM can extract content information from a handful of realistic examples and generate rich and complex samples from a variety of conditional distributions. More samples in Appendix Fig. \ref{['fig:intro_samples_cifar100_app']}.
  • Figure 2: Estimated $L_{\epsilon}$ per layer on CIFAR100 during training. FSDM is data efficient during training and can denoise the data better and faster than unconditional and conditional DDPM baselines.
  • Figure 3: Few-Shot Diffusion Models.
  • Figure 4: sViT architecture. The input is a set $\mathbf{X}$ of images. These are split in non-overlapping patches and fed to a transformer encoder using a shared positional encoding, as indicated by the patch colors. The sViT outputs a context as a vector (V) or collection of visual tokens (T). The DDPM is conditioned on this information using FiLM or attention.
  • Figure 5: Few-Shot Conditional samples on CIFAR100 using a FSDM. Left side conditioning set and samples from in-distribution classes; right side conditioning set and samples from out-distribution classes. More samples in higher resolution in Appendix Fig. \ref{['fig:conditional_samples_cifar100_app']} and Fig. \ref{['fig:few_shot_samples_cifar100_app']}.
  • ...and 5 more figures