Table of Contents
Fetching ...

ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization

Minseo Kim, Minchan Kwon, Dongyeun Lee, Yunho Jeon, Junmo Kim

TL;DR

ConceptPrism is a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set by jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity and a novel exclusion loss that compels residual tokens to discard the shared concept.

Abstract

Personalized text-to-image generation suffers from concept entanglement, where irrelevant residual information from reference images is captured, leading to a trade-off between concept fidelity and text alignment. Recent disentanglement approaches attempt to solve this utilizing manual guidance, such as linguistic cues or segmentation masks, which limits their applicability and fails to fully articulate the target concept. In this paper, we propose ConceptPrism, a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set. Our method jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity, and a novel exclusion loss that compels residual tokens to discard the shared concept. This process allows the target token to capture the pure concept without direct supervision. Extensive experiments demonstrate that ConceptPrism effectively resolves concept entanglement, achieving a significantly improved trade-off between fidelity and alignment.

ConceptPrism: Concept Disentanglement in Personalized Diffusion Models via Residual Token Optimization

TL;DR

ConceptPrism is a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set by jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity and a novel exclusion loss that compels residual tokens to discard the shared concept.

Abstract

Personalized text-to-image generation suffers from concept entanglement, where irrelevant residual information from reference images is captured, leading to a trade-off between concept fidelity and text alignment. Recent disentanglement approaches attempt to solve this utilizing manual guidance, such as linguistic cues or segmentation masks, which limits their applicability and fails to fully articulate the target concept. In this paper, we propose ConceptPrism, a novel framework that automatically disentangles the shared visual concept from image-specific residuals by comparing images within a set. Our method jointly optimizes a target token and image-wise residual tokens using two complementary objectives: a reconstruction loss to ensure fidelity, and a novel exclusion loss that compels residual tokens to discard the shared concept. This process allows the target token to capture the pure concept without direct supervision. Extensive experiments demonstrate that ConceptPrism effectively resolves concept entanglement, achieving a significantly improved trade-off between fidelity and alignment.
Paper Structure (41 sections, 9 equations, 11 figures, 5 tables, 2 algorithms)

This paper contains 41 sections, 9 equations, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: Motivation of ConceptPrism. The reconstruction loss ($\mathcal{L}_{\text{rec}}$) promotes information acquisition from the given image, while the exclusion loss ($\mathcal{L}_{\text{excl}}$) compels discarding the commonalities from the set. By jointly optimizing the target and residual tokens with dual losses, we disentangle the personalized visual concept from irrelevant details without explicit guidance.
  • Figure 2: Training Pipeline of ConceptPrism. Our method comprises two stages: (a) In the Token Optimization, the target and image-wise residual tokens are jointly optimized via dual losses. The reconstruction loss ($\mathcal{L}_{\text{rec}}$) guides the faithful reconstruction of the given image by conditioning on both tokens simultaneously. The exclusion loss ($\mathcal{L}_{\text{excl}}$) forces the residual token to be uninformative of the shared target concept $\mathcal{C}_{\text{target}}$ by matching the unconditional generation probability distribution. (b) In the Subsequent Fine-Tuning stage, the learned tokens initialize the model to focus only on the necessary concept, effectively resolving the trade-off caused by concept entanglement.
  • Figure 3: Performance trade-off by training steps. Our performance was measured at 40-step intervals, while other baselines were measured at 200-step intervals. Asterisk ($\star$) denotes the best visual quality step for each method. Our curve demonstrates a superior trade-off compared to all baselines, highlighting its effective concept disentanglement.
  • Figure 4: Qualitative Results on various subjects. Our method showed balanced performance, maintaining the visual details of the personalized subject while adhering to the text prompt. Other baselines tend to either copy input images (DreamBooth) or ignore concept representation to follow the prompt (Custom Diffusion, DisenBooth).
  • Figure 5: Qualitative Results on abstract styles. Our method learns the abstract style beyond a single word class noun. DreamBooth generates similar scenes due to overfitting to the reference images. Other baselines fail to capture the given visual details, adhering only to the linguistic cues (e.g., "sunset," "swirl").
  • ...and 6 more figures