Table of Contents
Fetching ...

PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control

Rishubh Parihar, Sachidanand VS, Sabariswaran Mani, Tejan Karmali, R. Venkatesh Babu

TL;DR

This work tackles faithful inversion and fine-grained facial attribute control in text-to-image diffusion by conditioning Stable Diffusion on StyleGAN2's $\mathcal{W+}$ latent space through a latent adaptor $\mathcal{M}$ that maps $w \in \mathcal{W+}$ to time-dependent token embeddings $(v_t^1, v_t^2)$. A two-stage training regime (pretraining $\mathcal{M}$ on $(I,w)$ with diffusion and identity losses, followed by subject-specific LoRA fine-tuning) enables strong identity preservation while maintaining text editability; inference uses delayed identity injection with a threshold $\tau$ and attribute edits via global latent directions $d$ with strength $\beta$, enabling continuous control. The system supports multi-subject composition by chaining diffusion processes and fusing outputs with instance masks to avoid attribute mixing. Empirically, it achieves a favorable balance between prompt similarity and identity preservation, provides high-quality fine-grained edits, and extends to in-the-wild and stylized face images, albeit with limitations in encoder inversion accuracy and multi-subject efficiency.

Abstract

Recently, we have seen a surge of personalization methods for text-to-image (T2I) diffusion models to learn a concept using a few images. Existing approaches, when used for face personalization, suffer to achieve convincing inversion with identity preservation and rely on semantic text-based editing of the generated face. However, a more fine-grained control is desired for facial attribute editing, which is challenging to achieve solely with text prompts. In contrast, StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation. This work uses the disentangled $\mathcal{W+}$ space of StyleGANs to condition the T2I model. This approach allows us to precisely manipulate facial attributes, such as smoothly introducing a smile, while preserving the existing coarse text-based control inherent in T2I models. To enable conditioning of the T2I model on the $\mathcal{W+}$ space, we train a latent mapper to translate latent codes from $\mathcal{W+}$ to the token embedding space of the T2I model. The proposed approach excels in the precise inversion of face images with attribute preservation and facilitates continuous control for fine-grained attribute editing. Furthermore, our approach can be readily extended to generate compositions involving multiple individuals. We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.

PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control

TL;DR

This work tackles faithful inversion and fine-grained facial attribute control in text-to-image diffusion by conditioning Stable Diffusion on StyleGAN2's latent space through a latent adaptor that maps to time-dependent token embeddings . A two-stage training regime (pretraining on with diffusion and identity losses, followed by subject-specific LoRA fine-tuning) enables strong identity preservation while maintaining text editability; inference uses delayed identity injection with a threshold and attribute edits via global latent directions with strength , enabling continuous control. The system supports multi-subject composition by chaining diffusion processes and fusing outputs with instance masks to avoid attribute mixing. Empirically, it achieves a favorable balance between prompt similarity and identity preservation, provides high-quality fine-grained edits, and extends to in-the-wild and stylized face images, albeit with limitations in encoder inversion accuracy and multi-subject efficiency.

Abstract

Recently, we have seen a surge of personalization methods for text-to-image (T2I) diffusion models to learn a concept using a few images. Existing approaches, when used for face personalization, suffer to achieve convincing inversion with identity preservation and rely on semantic text-based editing of the generated face. However, a more fine-grained control is desired for facial attribute editing, which is challenging to achieve solely with text prompts. In contrast, StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation. This work uses the disentangled space of StyleGANs to condition the T2I model. This approach allows us to precisely manipulate facial attributes, such as smoothly introducing a smile, while preserving the existing coarse text-based control inherent in T2I models. To enable conditioning of the T2I model on the space, we train a latent mapper to translate latent codes from to the token embedding space of the T2I model. The proposed approach excels in the precise inversion of face images with attribute preservation and facilitates continuous control for fine-grained attribute editing. Furthermore, our approach can be readily extended to generate compositions involving multiple individuals. We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.
Paper Structure (16 sections, 1 equation, 14 figures, 1 table)

This paper contains 16 sections, 1 equation, 14 figures, 1 table.

Figures (14)

  • Figure 1: Given a single portrait image, we embed the subject into a text-to-image diffusion model for personalized image generation. The embedded subject can then be transformed or placed in a novel context using text conditioning. The proposed method can also compose multiple learned subjects with high fidelity and identity preservation. To obtain precise inversion of face, we condition the T2I model on the rich $\mathcal{W+}$ latent space of StyleGAN2. This enables our method to additionally perform fine-grained control over the generated face with continuous control over facial attributes such as age and beard.
  • Figure 2: Framework for personalization. Given a single portrait image, we extract its $w$ latent representation from encoder $\mathcal{E}_{GAN}$. The latent $w$ along with diffusion timestep $t$ are passed through the latent adaptor $\mathcal{M}$ to generate a pair of time-dependent token embeddings $(v_t^1, v_t^2)$ representing the input subject. Finally, the token embeddings are combined with arbitrary prompts to generate customized images.
  • Figure 3: Delayed identity injection results in better text editability.
  • Figure 4: Fine-grained attribute editing. We map the given input image into $w$ latent code, which is shifted by a global linear attribute edit direction to obtain edited latent code $w*$. The edited latent code $w*$ is then passed through the T2I model to obtain fine-grained attribute edits. The scalar edit strength parameter $\beta$ can be changed to obtain continuous attribute control.
  • Figure 5: Composing multiple persons without finetuning results in identity distortion. Finetuning a single model for both the identities results in attribute mixing, the age and facial hairs from v1 are transferred to v2. Combining outputs of individual finetuned models results in excellent identity preservation without attribute mixing.
  • ...and 9 more figures