Table of Contents
Fetching ...

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

Jiannan Huang, Jun Hao Liew, Hanshu Yan, Yuyang Yin, Yao Zhao, Humphrey Shi, Yunchao Wei

TL;DR

ClassDiffusion addresses semantic drift and the associated loss of compositionality in personalized diffusion models. It introduces Semantic Preservation Loss (SPL) to constrain the target concept embeddings so they remain close to their superclass in the model's semantic space, yielding the objective $ \mathcal{L} = \mathcal{L}_{recon} + \lambda \mathcal{L}_{sp}$. The approach improves cross-contrast alignment and joint conditional sampling, demonstrated through image and video personalization with quantitative gains on text- and image-based metrics and qualitative demonstrations. A new evaluation metric, BLIP2-T, is proposed to better capture text-image alignment in this domain. Overall, ClassDiffusion offers a simple yet effective way to preserve semantic structure during personalization, enabling more reliable compositional generation and extending to personalized video synthesis.

Abstract

Recent text-to-image customization works have proven successful in generating images of given concepts by fine-tuning diffusion models on a few examples. However, tuning-based methods inherently tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (*e.g.*, headphone is missing when generating "a `dog wearing a headphone"). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (*e.g.*, "a dog wearing a headphone"), implying that the compositional ability only disappears after personalization tuning. We observe a semantic shift in the customized concept after fine-tuning, indicating that the personalized concept is not aligned with the original concept, and further show through theoretical analyses that this semantic shift leads to increased difficulty in sampling the joint conditional probability distribution, resulting in the loss of the compositional ability. Inspired by this finding, we present **ClassDiffusion**, a technique that leverages a **semantic preservation loss** to explicitly regulate the concept space when learning a new concept. Although simple, this approach effectively prevents semantic drift during the fine-tuning process of the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of fine-tuning models. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

TL;DR

ClassDiffusion addresses semantic drift and the associated loss of compositionality in personalized diffusion models. It introduces Semantic Preservation Loss (SPL) to constrain the target concept embeddings so they remain close to their superclass in the model's semantic space, yielding the objective . The approach improves cross-contrast alignment and joint conditional sampling, demonstrated through image and video personalization with quantitative gains on text- and image-based metrics and qualitative demonstrations. A new evaluation metric, BLIP2-T, is proposed to better capture text-image alignment in this domain. Overall, ClassDiffusion offers a simple yet effective way to preserve semantic structure during personalization, enabling more reliable compositional generation and extending to personalized video synthesis.

Abstract

Recent text-to-image customization works have proven successful in generating images of given concepts by fine-tuning diffusion models on a few examples. However, tuning-based methods inherently tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (*e.g.*, headphone is missing when generating "a `dog wearing a headphone"). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (*e.g.*, "a dog wearing a headphone"), implying that the compositional ability only disappears after personalization tuning. We observe a semantic shift in the customized concept after fine-tuning, indicating that the personalized concept is not aligned with the original concept, and further show through theoretical analyses that this semantic shift leads to increased difficulty in sampling the joint conditional probability distribution, resulting in the loss of the compositional ability. Inspired by this finding, we present **ClassDiffusion**, a technique that leverages a **semantic preservation loss** to explicitly regulate the concept space when learning a new concept. Although simple, this approach effectively prevents semantic drift during the fine-tuning process of the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of fine-tuning models. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.
Paper Structure (29 sections, 12 equations, 16 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 12 equations, 16 figures, 3 tables, 1 algorithm.

Figures (16)

  • Figure 1: The base Stable Diffusion (SD) possesses the capbility to compose the concept of a dog and headphone, generating a dog wearing a headphone. However, we notice that this compositional generation capability is lost during personalization tuning. For example, when using Custom Diffusion (CD) kumari2023multi, the headphone is missing despite the target corgi is generated successfully. On the other hand, our method can successfully compose the target corgi with the headphone.
  • Figure 2: A qualitative result of two small stories produced by our model. The above showcases a bear's literary journey: from reading a book to ultimately earning a Nobel Literature Prize. The below shows the fate of a sunglasses. Finally, the bear gets the sunglasses. It shows a potential real-world application due to our model's high performance.
  • Figure 3: Comparison of distances in CLIP text space.
  • Figure 4: Visualization of the cross-attention map activation area.
  • Figure 6: The orange and green point sets represent the distributions of dogs and headphones, respectively, and their overlapping regions represent their joint probability distributions. During the tuning process, the conditional distribution of dogs and headphones shrinks, which gradually increases the difficulty of sampling. Unlike the Prior Preservation Loss (PPL) in DreamBooth ruiz2023dreambooth, which aims to maintain class diversity, our proposed Semantic Preservation Loss (SPL) focuses on recovering the semantic space of the customized concept. This approach enables our method to synthesize images that are more consistent with the text prompt.
  • ...and 11 more figures