Table of Contents
Fetching ...

ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement

Habin Lim, Yeongseob Won, Juwon Seo, Gyeong-Moon Park

TL;DR

ConceptSplit addresses the problem of concept mixing in multi-concept diffusion model personalization by decoupling concept adaptation and attention control. It introduces Token-wise Value Adaptation (ToVA), which updates only the value projection for targeted tokens to avoid merging adapters, and Latent Optimization for Disentangled Attention (LODA), a two-stage latent-space approach that separates and then fixes attention to reduce entanglement. The method yields merging-free personalization with preserved token-attention binding and state-of-the-art disentanglement across benchmarks, validated by both quantitative metrics and qualitative analysis. It improves compositional fidelity and reduces interference, offering practical benefits for robust multi-concept synthesis in diffusion-based image generation, with code available online.

Abstract

In recent years, multi-concept personalization for text-to-image (T2I) diffusion models to represent several subjects in an image has gained much more attention. The main challenge of this task is "concept mixing", where multiple learned concepts interfere or blend undesirably in the output image. To address this issue, in this paper, we present ConceptSplit, a novel framework to split the individual concepts through training and inference. Our framework comprises two key components. First, we introduce Token-wise Value Adaptation (ToVA), a merging-free training method that focuses exclusively on adapting the value projection in cross-attention. Based on our empirical analysis, we found that modifying the key projection, a common approach in existing methods, can disrupt the attention mechanism and lead to concept mixing. Second, we propose Latent Optimization for Disentangled Attention (LODA), which alleviates attention entanglement during inference by optimizing the input latent. Through extensive qualitative and quantitative experiments, we demonstrate that ConceptSplit achieves robust multi-concept personalization, mitigating unintended concept interference. Code is available at https://github.com/KU-VGI/ConceptSplit

ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models via Token-wise Adaptation and Attention Disentanglement

TL;DR

ConceptSplit addresses the problem of concept mixing in multi-concept diffusion model personalization by decoupling concept adaptation and attention control. It introduces Token-wise Value Adaptation (ToVA), which updates only the value projection for targeted tokens to avoid merging adapters, and Latent Optimization for Disentangled Attention (LODA), a two-stage latent-space approach that separates and then fixes attention to reduce entanglement. The method yields merging-free personalization with preserved token-attention binding and state-of-the-art disentanglement across benchmarks, validated by both quantitative metrics and qualitative analysis. It improves compositional fidelity and reduces interference, offering practical benefits for robust multi-concept synthesis in diffusion-based image generation, with code available online.

Abstract

In recent years, multi-concept personalization for text-to-image (T2I) diffusion models to represent several subjects in an image has gained much more attention. The main challenge of this task is "concept mixing", where multiple learned concepts interfere or blend undesirably in the output image. To address this issue, in this paper, we present ConceptSplit, a novel framework to split the individual concepts through training and inference. Our framework comprises two key components. First, we introduce Token-wise Value Adaptation (ToVA), a merging-free training method that focuses exclusively on adapting the value projection in cross-attention. Based on our empirical analysis, we found that modifying the key projection, a common approach in existing methods, can disrupt the attention mechanism and lead to concept mixing. Second, we propose Latent Optimization for Disentangled Attention (LODA), which alleviates attention entanglement during inference by optimizing the input latent. Through extensive qualitative and quantitative experiments, we demonstrate that ConceptSplit achieves robust multi-concept personalization, mitigating unintended concept interference. Code is available at https://github.com/KU-VGI/ConceptSplit

Paper Structure

This paper contains 25 sections, 13 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: The goals of our framework ConceptSplit are twofold. (1) Preventing concept interference in the adapter approach (denoted as $\mathcal{A}$), while preserving the token-attention binding capacity of T2I models. (2) Separating entangled attention, which would otherwise result in a mixed representation of learned concepts.
  • Figure 1: Qualitative comparison in single-object scenarios on Stable Diffusion 2.1. In single-object scenarios, our approach ensures that the background is appropriately generated alongside the target concept, maintaining contextual integrity.
  • Figure 2: Overview of our framework ConceptSplit. (a) Training Phase: Adapters are applied exclusively to target tokens, modifying only their values while preserving the model’s baseline attention capacity. These adapters are then stored in a database (DB). (b) Inference Phase: Pre-trained adapters are dynamically attached to desired tokens, enabling selective value modulation without weight merging. This ensures minimal interference with unrelated concepts. (c) LODA: We further optimize the latent space for $N$ inference steps to disentangle attention (stage 1), then fixing the disentangled attention to enable natural generation of distinct objects without continuous optimization (stage 2).
  • Figure 2: Effect of hyperparameters $p$ and $m$, which respectively strengthen and weaken the attention scores of each token.
  • Figure 3:
  • ...and 7 more figures