Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions
Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Melvin Sevi, Vincent Tao Hu, Björn Ommer
TL;DR
The paper addresses the challenge of achieving continuous, subject-specific control over high-level attributes in text-to-image diffusion models without modifying the diffusion model itself. It reveals that tokenwise CLIP embeddings contain semantically meaningful local directions that, when augmented to subject tokens, enable fine-grained, additive control over attributes for individual subjects and even across multiple subjects within a single image. Two methods are proposed to identify robust semantic directions: a contrastive-prompt difference approach (optimization-free) and a learning-based diffusion-noise distillation method that yields more generalizable directions. The approach enables real-image editing via inversion, is compatible with existing editing methods, demonstrates strong subject-specificity and disentanglement, and generalizes across nouns, models, and even non-diffusion architectures, with limitations acknowledged and avenues for future work discussed.
Abstract
Recent advances in text-to-image (T2I) diffusion models have significantly improved the quality of generated images. However, providing efficient control over individual subjects, particularly the attributes characterizing them, remains a key challenge. While existing methods have introduced mechanisms to modulate attribute expression, they typically provide either detailed, object-specific localization of such a modification or full-scale fine-grained, nuanced control of attributes. No current approach offers both simultaneously, resulting in a gap when trying to achieve precise continuous and subject-specific attribute modulation in image generation. In this work, we demonstrate that token-level directions exist within commonly used CLIP text embeddings that enable fine-grained, subject-specific control of high-level attributes in T2I models. We introduce two methods to identify these directions: a simple, optimization-free technique and a learning-based approach that utilizes the T2I model to characterize semantic concepts more specifically. Our methods allow the augmentation of the prompt text input, enabling fine-grained control over multiple attributes of individual subjects simultaneously, without requiring any modifications to the diffusion model itself. This approach offers a unified solution that fills the gap between global and localized control, providing competitive flexibility and precision in text-guided image generation. Project page: https://compvis.github.io/attribute-control. Code is available at https://github.com/CompVis/attribute-control.
