Few-Shot Diffusion Models
Giorgio Giannone, Didrik Nielsen, Ole Winther
TL;DR
Few-Shot Diffusion Models (FSDM) introduce a set-conditioned diffusion framework that learns a context from a small support set X via a ViT-based set encoder to guide a conditional DDPM. By fusing the context c into the diffusion path through FiLM or cross-attention-based mechanisms, FSDM enables high-quality few-shot generation and transfer to new datasets, with faster convergence and improved conditioning than prior baselines. The approach is validated across diverse datasets, showing superior denoising and generation metrics, effective cross-dataset transfer, and favorable comparisons to test-time conditioning methods. A variational extension (VFSDM) is explored but found more challenging to train, highlighting the practical strength of deterministic set-conditioned conditioning. Overall, FSDM demonstrates a scalable, expressive method for rapid adaptation of diffusion models to novel concepts from limited examples, with broad applicability to different modalities via patch-based tokenization.
Abstract
Denoising diffusion probabilistic models (DDPM) are powerful hierarchical latent variable models with remarkable sample generation quality and training stability. These properties can be attributed to parameter sharing in the generative hierarchy, as well as a parameter-free diffusion-based inference procedure. In this paper, we present Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs. FSDMs are trained to adapt the generative process conditioned on a small set of images from a given class by aggregating image patch information using a set-based Vision Transformer (ViT). At test time, the model is able to generate samples from previously unseen classes conditioned on as few as 5 samples from that class. We empirically show that FSDM can perform few-shot generation and transfer to new datasets. We benchmark variants of our method on complex vision datasets for few-shot learning and compare to unconditional and conditional DDPM baselines. Additionally, we show how conditioning the model on patch-based input set information improves training convergence.
