Table of Contents
Fetching ...

ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization

Shizhan Liu, Hao Zheng, Hang Yu, Jianguo Li

TL;DR

The paper tackles concept coupling in text-to-image personalization by formulating it as unintended dependencies between a personalization target and other concepts. It introduces ACCORD, which splits the problem into two computable dependence discrepancies and provides two plug-and-play losses, Denoising Decouple Loss and Prior Decouple Loss, to minimize them without extra data. The approach is validated across subject, style, and zero-shot face personalization, showing improved text control and personalization fidelity over strong baselines and robust ablations. Its plug-and-play nature and theoretical grounding offer a practical path to more faithful personalized generation with diffusion models.

Abstract

Image personalization has garnered attention for its ability to customize Text-to-Image generation using only a few reference images. However, a key challenge in image personalization is the issue of conceptual coupling, where the limited number of reference images leads the model to form unwanted associations between the personalization target and other concepts. Current methods attempt to tackle this issue indirectly, leading to a suboptimal balance between text control and personalization fidelity. In this paper, we take a direct approach to the concept coupling problem through statistical analysis, revealing that it stems from two distinct sources of dependence discrepancies. We therefore propose two complementary plug-and-play loss functions: Denoising Decouple Loss and Prior Decouple loss, each designed to minimize one type of dependence discrepancy. Extensive experiments demonstrate that our approach achieves a superior trade-off between text control and personalization fidelity.

ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization

TL;DR

The paper tackles concept coupling in text-to-image personalization by formulating it as unintended dependencies between a personalization target and other concepts. It introduces ACCORD, which splits the problem into two computable dependence discrepancies and provides two plug-and-play losses, Denoising Decouple Loss and Prior Decouple Loss, to minimize them without extra data. The approach is validated across subject, style, and zero-shot face personalization, showing improved text control and personalization fidelity over strong baselines and robust ablations. Its plug-and-play nature and theoretical grounding offer a practical path to more faithful personalized generation with diffusion models.

Abstract

Image personalization has garnered attention for its ability to customize Text-to-Image generation using only a few reference images. However, a key challenge in image personalization is the issue of conceptual coupling, where the limited number of reference images leads the model to form unwanted associations between the personalization target and other concepts. Current methods attempt to tackle this issue indirectly, leading to a suboptimal balance between text control and personalization fidelity. In this paper, we take a direct approach to the concept coupling problem through statistical analysis, revealing that it stems from two distinct sources of dependence discrepancies. We therefore propose two complementary plug-and-play loss functions: Denoising Decouple Loss and Prior Decouple loss, each designed to minimize one type of dependence discrepancy. Extensive experiments demonstrate that our approach achieves a superior trade-off between text control and personalization fidelity.

Paper Structure

This paper contains 21 sections, 3 theorems, 21 equations, 8 figures, 6 tables.

Key Result

Lemma 1

$\mathbb{E}_{{\mathbf{x}}_\theta}[ | \log r({\mathbf{c}}_p, {\mathbf{c}}_g | {\mathbf{x}}_{\theta, 0}) - \log r({\mathbf{c}}_s, {\mathbf{c}}_g) | ] >0$ holds when either (i) $r({\mathbf{c}}_p, {\mathbf{c}}_g | {\mathbf{x}}_{\theta, 0}) > r({\mathbf{c}}_s, {\mathbf{c}}_g)$ (overly positive dependence

Figures (8)

  • Figure 1: Illustration of the concept coupling problem. The personalization target is a "backpack", but in the reference images, the backpack and a girl always appear together. This causes the model finetuned without concept decoupling to frequently generate an additional girl and not fully adhere to the prompt. Statistically, the co-occurrence of "backpack" and "girl" in generated images is significantly higher than the inherent concept dependence.
  • Figure 2: Calculation of the Denoising Decouple Loss $\mathcal{L}_\text{DD}$. The UNet estimates ${\mathbf{x}}_{t-1}$ based on ${\mathbf{x}}_t$ and four different conditions, then constrains the four denoising results. The objective of $\mathcal{L}_\text{DD}$ is to prevent the conditional dependence coefficient between the personalization target ${\mathbf{c}}_p$ and general text conditions ${\mathbf{c}}_g$ from varying significantly in the denoising results of adjacent timesteps.
  • Figure 3: The calculation of the Prior Decouple Loss $\mathcal{L}_\text{PD}$. The purpose of $\mathcal{L}_\text{PD}$ is to prevent excessive prior dependence between ${\mathbf{c}}_p$ and general text conditions ${\mathbf{c}}_g$. During computation, we first use the CLIP projector to map ${\mathbf{c}}_p$ and ${\mathbf{c}}_g$ into ${\mathbf{f}}_s$ and ${\mathbf{f}}_g$, respectively, and then minimize their absolute cosine similarity.
  • Figure 4: A comparison of the visual outcomes of subject personalization, style personalization, and face personalization, where "superclass*" denotes the personalization target.
  • Figure 5: Visualization of the impact of DDLoss and PDLoss.
  • ...and 3 more figures

Theorems & Definitions (5)

  • Lemma 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof