DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation

Haonan Lin; Mengmeng Wang; Yan Chen; Wenbin An; Yuzhe Yao; Guang Dai; Qianying Wang; Yong Liu; Jingdong Wang

DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation

Haonan Lin, Mengmeng Wang, Yan Chen, Wenbin An, Yuzhe Yao, Guang Dai, Qianying Wang, Yong Liu, Jingdong Wang

TL;DR

DreamSalon tackles identity fine editing in face images by introducing a staged, noise-guided diffusion framework that separates aggressive editing from quality boosting. It leverages high-frequency cues and the gradient of predicted noises to determine editing versus boosting phases, and employs covariance-guided semantic mixing to align source identity with target edits. The method provides fast per-identity personalization, a detailed editing mechanism without extra encoders, and strong empirical results that outperform state-of-the-art baselines in both qualitative and quantitative evaluations. This work advances precise, identity-preserving editing with practical efficiency and introduces a principled way to semantically control prompt integration during diffusion-based image synthesis.

Abstract

While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centered images, novel challenges arise with a nuanced task of "identity fine editing": precisely modifying specific features of a subject while maintaining its inherent identity and context. Existing personalization methods either require time-consuming optimization or learning additional encoders, adept in "identity re-contextualization". However, they often struggle with detailed and sensitive tasks like human face editing. To address these challenges, we introduce DreamSalon, a noise-guided, staged-editing framework, uniquely focusing on detailed image manipulations and identity-context preservation. By discerning editing and boosting stages via the frequency and gradient of predicted noises, DreamSalon first performs detailed manipulations on specific features in the editing stage, guided by high-frequency information, and then employs stochastic denoising in the boosting stage to improve image quality. For more precise editing, DreamSalon semantically mixes source and target textual prompts, guided by differences in their embedding covariances, to direct the model's focus on specific manipulation areas. Our experiments demonstrate DreamSalon's ability to efficiently and faithfully edit fine details on human faces, outperforming existing methods both qualitatively and quantitatively.

DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation

TL;DR

Abstract

Paper Structure (26 sections, 9 equations, 10 figures, 3 tables)

This paper contains 26 sections, 9 equations, 10 figures, 3 tables.

Introduction
Related Work
Text-to-Image Generation
Personalized Image Synthesis for Face Identity
Methods
Preliminary
Denoising Diffusion Implicit Model (DDIM)
Personalized Weights Generation
Staged Editing
Boosting Stage with Stochastic Denoising
Editing Stage with Frequency Guidance
Covariance Guidance for Detailed Editing
Overall Loss
Experiments
Experimental Settings
...and 11 more sections

Figures (10)

Figure 1: Unlike "identity re-contextualization" (Dreambooth ruiz2022dreambooth), "identity fine editing" precisely manipulates details while preserving identity and context (DreamSalon).
Figure 2: DreamSalon pipeline. Phase 1: fast fine-tuning a hypernetwork per identity, obtaining personalization weights for the Latent Diffusion Model. Phase 2: noise-guided staged editing, where the aggressive-editing stage (before $t_{\text{edit}}$) and the quality-boosting stage (after $t_{\text{boost}}$) are discerned via predicted noises.
Figure 3: Stage discernment based on frequency and gradient of predicted noises. The editing stage is determined by high-frequency predicted noises (75% quantile), followed by the boosting stage where gradients are relatively smaller (25% quantile). More details about the frequency and gradient of predicted noises are available in the Suppl.
Figure 4: Covariance analysis in prompt embeddings: differences in covariance matrices for source and target prompt embeddings, guide the semantic mixing of prompts for precise attribute editing in generated images.
Figure 5: The FFE-Bench for fine-grained face editing across different views and challenging conditions, with DreamSalon's edits like attribute additions and expression changes.
...and 5 more figures

DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation

TL;DR

Abstract

DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)