High-fidelity Person-centric Subject-to-Image Synthesis
Yibin Wang, Weizhong Zhang, Jianwei Zheng, Cheng Jin
TL;DR
Face-diffuser addresses the irreconcilable training imbalance and fidelity compromises in current subject-driven person-centric generation by decoupling scene and subject generation into two specialized diffusion models (TDM for semantic scenes and SDM for subjects) and introducing Saliency-adaptive Noise Fusion (SNF) to coordinate them during a three-stage, test-time sampling. The approach enables high-fidelity, inference-time personalization for unseen subjects without subject-specific fine-tuning, demonstrated through extensive quantitative and qualitative comparisons against state-of-the-art methods. Key contributions include the independent training of two diffusion models, the novel SNF collaboration mechanism, and a three-stage sampling pipeline (semantic scene construction, subject-scene fusion, and subject enhancement) with strong empirical results on single- and multi-subject generation. This work advances practical, scalable, and high-quality subject-to-image synthesis with potential implications for entertainment, AR/VR, and visual content creation, while acknowledging privacy and ethical considerations associated with realistic face generation.
Abstract
Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.
