Table of Contents
Fetching ...

High-fidelity Person-centric Subject-to-Image Synthesis

Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin

TL;DR

Face-diffuser addresses the irreconcilable training imbalance and fidelity compromises in current subject-driven person-centric generation by decoupling scene and subject generation into two specialized diffusion models (TDM for semantic scenes and SDM for subjects) and introducing Saliency-adaptive Noise Fusion (SNF) to coordinate them during a three-stage, test-time sampling. The approach enables high-fidelity, inference-time personalization for unseen subjects without subject-specific fine-tuning, demonstrated through extensive quantitative and qualitative comparisons against state-of-the-art methods. Key contributions include the independent training of two diffusion models, the novel SNF collaboration mechanism, and a three-stage sampling pipeline (semantic scene construction, subject-scene fusion, and subject enhancement) with strong empirical results on single- and multi-subject generation. This work advances practical, scalable, and high-quality subject-to-image synthesis with potential implications for entertainment, AR/VR, and visual content creation, while acknowledging privacy and ethical considerations associated with realistic face generation.

Abstract

Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.

High-fidelity Person-centric Subject-to-Image Synthesis

TL;DR

Face-diffuser addresses the irreconcilable training imbalance and fidelity compromises in current subject-driven person-centric generation by decoupling scene and subject generation into two specialized diffusion models (TDM for semantic scenes and SDM for subjects) and introducing Saliency-adaptive Noise Fusion (SNF) to coordinate them during a three-stage, test-time sampling. The approach enables high-fidelity, inference-time personalization for unseen subjects without subject-specific fine-tuning, demonstrated through extensive quantitative and qualitative comparisons against state-of-the-art methods. Key contributions include the independent training of two diffusion models, the novel SNF collaboration mechanism, and a three-stage sampling pipeline (semantic scene construction, subject-scene fusion, and subject enhancement) with strong empirical results on single- and multi-subject generation. This work advances practical, scalable, and high-quality subject-to-image synthesis with potential implications for entertainment, AR/VR, and visual content creation, while acknowledging privacy and ethical considerations associated with realistic face generation.

Abstract

Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.
Paper Structure (28 sections, 7 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 15 figures, 4 tables, 1 algorithm.

Figures (15)

  • Figure 1: Displayed are the results generated using our Face-diffuser, showcasing its prowess across varied inputs. Each instance comprises two distinct inputs: a textual description, and reference images.
  • Figure 2: Current methods jointly learn the generation of semantic scenes and persons, which leads to a compromise in the quality of person generation (left), and the irreconcilable training imbalance issue leads to catastrophic forgetting of semantic scenes prior (right).
  • Figure 3: The experimental results showcasing the irreconcilable training imbalance between semantic scene and person generation of Fastcomposer fastcomposer and Subject-diffusion subject-diffusion. We partitioned the FFHQ-wild fastcomposer dataset into training and test sets following a 6:1 ratio and assessed their performance in terms of identity preservation and prompt consistency during continuous training.
  • Figure 4: An overview of the Face-diffuser framework. On the left, we display the architectures of two pre-trained models, derived from Stable Diffusion ldm, while omitting the autoencoder for simplicity. On the right, we outline our sampling process, which consists of three well-designed stages.
  • Figure 5: Qualitative comparative results against state-of-the-art methods on single-subject generation.
  • ...and 10 more figures