Table of Contents
Fetching ...

Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Melvin Sevi, Vincent Tao Hu, Björn Ommer

TL;DR

The paper addresses the challenge of achieving continuous, subject-specific control over high-level attributes in text-to-image diffusion models without modifying the diffusion model itself. It reveals that tokenwise CLIP embeddings contain semantically meaningful local directions that, when augmented to subject tokens, enable fine-grained, additive control over attributes for individual subjects and even across multiple subjects within a single image. Two methods are proposed to identify robust semantic directions: a contrastive-prompt difference approach (optimization-free) and a learning-based diffusion-noise distillation method that yields more generalizable directions. The approach enables real-image editing via inversion, is compatible with existing editing methods, demonstrates strong subject-specificity and disentanglement, and generalizes across nouns, models, and even non-diffusion architectures, with limitations acknowledged and avenues for future work discussed.

Abstract

Recent advances in text-to-image (T2I) diffusion models have significantly improved the quality of generated images. However, providing efficient control over individual subjects, particularly the attributes characterizing them, remains a key challenge. While existing methods have introduced mechanisms to modulate attribute expression, they typically provide either detailed, object-specific localization of such a modification or full-scale fine-grained, nuanced control of attributes. No current approach offers both simultaneously, resulting in a gap when trying to achieve precise continuous and subject-specific attribute modulation in image generation. In this work, we demonstrate that token-level directions exist within commonly used CLIP text embeddings that enable fine-grained, subject-specific control of high-level attributes in T2I models. We introduce two methods to identify these directions: a simple, optimization-free technique and a learning-based approach that utilizes the T2I model to characterize semantic concepts more specifically. Our methods allow the augmentation of the prompt text input, enabling fine-grained control over multiple attributes of individual subjects simultaneously, without requiring any modifications to the diffusion model itself. This approach offers a unified solution that fills the gap between global and localized control, providing competitive flexibility and precision in text-guided image generation. Project page: https://compvis.github.io/attribute-control. Code is available at https://github.com/CompVis/attribute-control.

Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

TL;DR

The paper addresses the challenge of achieving continuous, subject-specific control over high-level attributes in text-to-image diffusion models without modifying the diffusion model itself. It reveals that tokenwise CLIP embeddings contain semantically meaningful local directions that, when augmented to subject tokens, enable fine-grained, additive control over attributes for individual subjects and even across multiple subjects within a single image. Two methods are proposed to identify robust semantic directions: a contrastive-prompt difference approach (optimization-free) and a learning-based diffusion-noise distillation method that yields more generalizable directions. The approach enables real-image editing via inversion, is compatible with existing editing methods, demonstrates strong subject-specificity and disentanglement, and generalizes across nouns, models, and even non-diffusion architectures, with limitations acknowledged and avenues for future work discussed.

Abstract

Recent advances in text-to-image (T2I) diffusion models have significantly improved the quality of generated images. However, providing efficient control over individual subjects, particularly the attributes characterizing them, remains a key challenge. While existing methods have introduced mechanisms to modulate attribute expression, they typically provide either detailed, object-specific localization of such a modification or full-scale fine-grained, nuanced control of attributes. No current approach offers both simultaneously, resulting in a gap when trying to achieve precise continuous and subject-specific attribute modulation in image generation. In this work, we demonstrate that token-level directions exist within commonly used CLIP text embeddings that enable fine-grained, subject-specific control of high-level attributes in T2I models. We introduce two methods to identify these directions: a simple, optimization-free technique and a learning-based approach that utilizes the T2I model to characterize semantic concepts more specifically. Our methods allow the augmentation of the prompt text input, enabling fine-grained control over multiple attributes of individual subjects simultaneously, without requiring any modifications to the diffusion model itself. This approach offers a unified solution that fills the gap between global and localized control, providing competitive flexibility and precision in text-guided image generation. Project page: https://compvis.github.io/attribute-control. Code is available at https://github.com/CompVis/attribute-control.
Paper Structure (75 sections, 8 equations, 28 figures, 2 tables)

This paper contains 75 sections, 8 equations, 28 figures, 2 tables.

Figures (28)

  • Figure 1: Quantitative comparison with other control methods. We evaluate (a) subject-specificity of control in multi-subject settings, (b) disentangledness of attribute control v.s. overall image changes, where we normalize the change metrics $\Delta \text{Id}$ and $\mathrm{LPIPS}$ by the attribute expression change $|\Delta \mathrm{CLIP}_\mathrm{Bi}|$, (c) whether the method can be used for fully/uninterrupted continuous control from the original image, and (d) image generation speed (using an Nvidia A100 at batch size 1).
  • Figure 2: The tokenwise CLIP text embedding space is not globally smooth. We linearly interpolate between the embeddings of two prompts while keeping the noise seed fixed. Near the original embeddings, changes are smooth and semantically interpretable, but strong phase transitions exist between substantially different subjects (e.g., "car" vs. "frog").
  • Figure 3: The tokenwise CLIP embedding space enables subject-specific interventions. Changes to the embedding of subject tokens can lead to disentangled local changes focused on that subject.
  • Figure 4: Variations along "vehicle price" directions identified using our methods. (a) Modulate along direction from difference-based approach (\ref{['sec:method_naive_deltas']}). (b) Modulate along direction from robust learned approach (\ref{['sec:method_robust_deltas']}). Unmodified images are marked in green. These directions successfully capture the target attribute and allow for fine-grained modulation but (a) also shows unwanted side-effects such as flipping the car's orientation.
  • Figure 5: Illustration of our method's intuition. We find that directions corresponding to modulating an attribute $A_i$ in the noise prediction space $\Delta \tilde{\boldsymbol{\epsilon}}$ (green) from a specific starting point $\mathbf{x}_t$ can be backpropagated (purple) through the diffusion model (\ref{['eq:loss_direction']}) to obtain a generalized matching direction $\Delta\mathbf{e}_{A_i}$ (blue) in the tokenwise embedding space. $\mathcal{E}(P)$ is the prompt embedding, $\hat{\boldsymbol{\epsilon}}_{\theta}(\cdot)$ the diffusion model.
  • ...and 23 more figures