Table of Contents
Fetching ...

Zero-shot Face Editing via ID-Attribute Decoupled Inversion

Yang Hou, Minggu Wang, Jianjun Zhao

TL;DR

Addresses the challenge of preserving identity and structural fidelity in face editing with diffusion-based methods. It proposes ID-Attribute Decoupled Inversion, which splits identity features via an entire face embedding and attributes via text embeddings, jointly guiding both inversion and reverse diffusion. A 69,900-pair face-attribute dataset and LoRA-based fine-tuning of Stable Diffusion enable zero-shot editing with prompts alone, without region masks. Empirical results on FFHQ/CelebA-HQ show superior ID preservation, structural consistency, and editing quality compared with state-of-the-art baselines, with editing speed comparable to DDIM.

Abstract

Recent advancements in text-guided diffusion models have shown promise for general image editing via inversion techniques, but often struggle to maintain ID and structural consistency in real face editing tasks. To address this limitation, we propose a zero-shot face editing method based on ID-Attribute Decoupled Inversion. Specifically, we decompose the face representation into ID and attribute features, using them as joint conditions to guide both the inversion and the reverse diffusion processes. This allows independent control over ID and attributes, ensuring strong ID preservation and structural consistency while enabling precise facial attribute manipulation. Our method supports a wide range of complex multi-attribute face editing tasks using only text prompts, without requiring region-specific input, and operates at a speed comparable to DDIM inversion. Comprehensive experiments demonstrate its practicality and effectiveness.

Zero-shot Face Editing via ID-Attribute Decoupled Inversion

TL;DR

Addresses the challenge of preserving identity and structural fidelity in face editing with diffusion-based methods. It proposes ID-Attribute Decoupled Inversion, which splits identity features via an entire face embedding and attributes via text embeddings, jointly guiding both inversion and reverse diffusion. A 69,900-pair face-attribute dataset and LoRA-based fine-tuning of Stable Diffusion enable zero-shot editing with prompts alone, without region masks. Empirical results on FFHQ/CelebA-HQ show superior ID preservation, structural consistency, and editing quality compared with state-of-the-art baselines, with editing speed comparable to DDIM.

Abstract

Recent advancements in text-guided diffusion models have shown promise for general image editing via inversion techniques, but often struggle to maintain ID and structural consistency in real face editing tasks. To address this limitation, we propose a zero-shot face editing method based on ID-Attribute Decoupled Inversion. Specifically, we decompose the face representation into ID and attribute features, using them as joint conditions to guide both the inversion and the reverse diffusion processes. This allows independent control over ID and attributes, ensuring strong ID preservation and structural consistency while enabling precise facial attribute manipulation. Our method supports a wide range of complex multi-attribute face editing tasks using only text prompts, without requiring region-specific input, and operates at a speed comparable to DDIM inversion. Comprehensive experiments demonstrate its practicality and effectiveness.

Paper Structure

This paper contains 7 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: In each pair of images, the left shows the original input image with its corresponding text description displayed below. The right shows the edited image, with the modified text description displayed below it. we edit the face image based on the modified text description. (Zoom in to see details)
  • Figure 2: The left diagram illustrates a $T$-step DDIM inversion and reverse diffusion process, where $z_{T} \rightarrow z_{0}$ represents the ideal reverse diffusion denoising trajectory. $z_{0} \rightarrow z^*_{T}$ denotes the DDIM inversion trajectory guided by the text condition $P$, yielding $z^*_{T}$ as an approximation of $z_{T}$. $z^*_T \rightarrow z'_0$ is the reconstruction trajectory under the guidance of condition $P$, while $z^*_T \rightarrow z"_0$ represents the reverse diffusion process trajectory guided by a new condition $P_n$, resulting in $z"_0$ deviating significantly from $z_0$. The middle diagram illustrates existing inversion-based image editing method, which typically use the reconstruction trajectory $z^*_T \rightarrow z'_0$ as a reference to optimize $z^*_T \rightarrow z"_0$. The right diagram illustrates our method, which uses both ID features and facial attributes as joint conditions to guide the inversion and reverse diffusion processes. Under the guidance of these two conditions, the inversion yields a synthesized $z^*_T$ that is closer to the ideal initial latent code $z_{T}$. The reverse diffusion process then starts from $z^*_T$ and results in the synthesized output $z"_0$, which is pulled towards $z_0$ under the constraint of the input conditions.
  • Figure 3: Comparison of reconstruction results between our method and text-guided DDIM inversion.
  • Figure 4: Comparison of different methods on single-attribute editing tasks. Each row corresponds to a different attribute editing task. It can be seen that our method outperforms existing approaches in terms of editing accuracy, as well as ID and structural consistency. (Zoom in to see details.)
  • Figure 5: Comparison of different methods on multi-attribute editing tasks. It can be seen that our method still achieves high-quality editing results in multi-attribute editing tasks, maintaining both ID and structural consistency.