Table of Contents
Fetching ...

Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion

Yu Cao, Shaogang Gong

TL;DR

This work tackles Few-Shot Image Generation by introducing Conditional Relaxing Diffusion Inversion (CRDI), a training-free approach that enhances distribution diversity without fine-tuning. CRDI leverages a per-sample Sample-wise Guidance Embedding (SGE) to reconstruct target instances and then employs an annealing noise scheduler to diversify the generated outputs, with a rigidity parameter $\eta$ controlling the time-dependence of guidance. The method provides a theoretical diffusion-model perspective and demonstrates strong empirical performance, outperforming GAN-based reconstruction and matching or exceeding state-of-the-art FSIG methods across multiple target domains while mitigating overfitting and forgetting. The approach is compatible with existing diffusion models, scalable, and emphasizes a practical balance between reconstruction quality and distribution coverage, with potential extensions such as CLIP-guided guidance for semantic emphasis.

Abstract

In the field of Few-Shot Image Generation (FSIG) using Deep Generative Models (DGMs), accurately estimating the distribution of target domain with minimal samples poses a significant challenge. This requires a method that can both capture the broad diversity and the true characteristics of the target domain distribution. We present Conditional Relaxing Diffusion Inversion (CRDI), an innovative `training-free' approach designed to enhance distribution diversity in synthetic image generation. Distinct from conventional methods, CRDI does not rely on fine-tuning based on only a few samples. Instead, it focuses on reconstructing each target image instance and expanding diversity through few-shot learning. The approach initiates by identifying a Sample-wise Guidance Embedding (SGE) for the diffusion model, which serves a purpose analogous to the explicit latent codes in certain Generative Adversarial Network (GAN) models. Subsequently, the method involves a scheduler that progressively introduces perturbations to the SGE, thereby augmenting diversity. Comprehensive experiments demonstrates that our method surpasses GAN-based reconstruction techniques and equals state-of-the-art (SOTA) FSIG methods in performance. Additionally, it effectively mitigates overfitting and catastrophic forgetting, common drawbacks of fine-tuning approaches.

Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion

TL;DR

This work tackles Few-Shot Image Generation by introducing Conditional Relaxing Diffusion Inversion (CRDI), a training-free approach that enhances distribution diversity without fine-tuning. CRDI leverages a per-sample Sample-wise Guidance Embedding (SGE) to reconstruct target instances and then employs an annealing noise scheduler to diversify the generated outputs, with a rigidity parameter controlling the time-dependence of guidance. The method provides a theoretical diffusion-model perspective and demonstrates strong empirical performance, outperforming GAN-based reconstruction and matching or exceeding state-of-the-art FSIG methods across multiple target domains while mitigating overfitting and forgetting. The approach is compatible with existing diffusion models, scalable, and emphasizes a practical balance between reconstruction quality and distribution coverage, with potential extensions such as CLIP-guided guidance for semantic emphasis.

Abstract

In the field of Few-Shot Image Generation (FSIG) using Deep Generative Models (DGMs), accurately estimating the distribution of target domain with minimal samples poses a significant challenge. This requires a method that can both capture the broad diversity and the true characteristics of the target domain distribution. We present Conditional Relaxing Diffusion Inversion (CRDI), an innovative `training-free' approach designed to enhance distribution diversity in synthetic image generation. Distinct from conventional methods, CRDI does not rely on fine-tuning based on only a few samples. Instead, it focuses on reconstructing each target image instance and expanding diversity through few-shot learning. The approach initiates by identifying a Sample-wise Guidance Embedding (SGE) for the diffusion model, which serves a purpose analogous to the explicit latent codes in certain Generative Adversarial Network (GAN) models. Subsequently, the method involves a scheduler that progressively introduces perturbations to the SGE, thereby augmenting diversity. Comprehensive experiments demonstrates that our method surpasses GAN-based reconstruction techniques and equals state-of-the-art (SOTA) FSIG methods in performance. Additionally, it effectively mitigates overfitting and catastrophic forgetting, common drawbacks of fine-tuning approaches.
Paper Structure (11 sections, 8 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 11 sections, 8 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of reconstruction results using Image2StyleGAN (GAN Method) abdal2019image2stylegan and our proposed method (CRDI) with varying $\eta$ values. Both source models are pre-trained on FFHQ karras2019style. Our CRDI used the fast sampling method DDIM song2020denoising with total 25 inference steps. We further set $\eta\!=\!1, 8, 15$, whilst larger value means a stricter diffusion time-dependent SGE.
  • Figure 2: A visualization of three randomly sampled trajectories (blue, orange and green), all originating from the same initial point (red) and generated using Langevin dynamics. The green dot represent the distribution of the intermediate state $x_t$. A time-independent SGE is learned from one direct trajectory (yellow), which can be regarded as a directional path from $x_\alpha$ to $x_\beta$. The SGEs used to guide generation are perturbed by noise (grey) as defined in Sec. \ref{['sec:diversity']}. Note that the right corner does not represent $x_0$.
  • Figure 3: Generated Babies facial images with different $\eta$, slightly source domain leakage problem (orange box) when $\eta=1$.
  • Figure 4: Left: t-SNE results of given samples from Target Domain (Babies) (red), Source Domain (FFHQ) (blue), our generated samples (green), RICK mondal2022few generated samples (purple). We show that our generated samples are more align with given target domain samples over RICK. Right: A simulation depicting two SDE transitions from $P_I$ to the $P_S$ and $P_T$. The two solid red lines illustrate the mean trajectories towards the Source and Target Domains, while the red dashed line indicates their extension.
  • Figure 5: We present the generated samples and Intra-LPIPS ($\boldsymbol{\uparrow}$) for our method alongside four other high performance methods across Babies ($\mathcal{T}_1$) and MetFaces ($\mathcal{T}_2$) with different degrees of similarity to the source domain ($\mathcal{S}$). While not consistently the best in Intra-LPIPS ($\boldsymbol{\uparrow}$), the quality and mode coverage (red box) of our samples is superior, characterized by fewer artifacts and an absence of noticeable overfitting phenomena. Best in bold and the second best in underline with bold. For more examples, please refer to Supplementary.
  • ...and 1 more figures