Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation
Renshuai Liu, Bowen Ma, Wei Zhang, Zhipeng Hu, Changjie Fan, Tangjie Lv, Yu Ding, Xuan Cheng
TL;DR
The paper tackles the problem of personalized face generation by enabling simultaneous identity and fine-grained expression control within a specified background. It introduces a diffusion-based framework, DiffSFSR, that ingests a background prompt, a user selfie, and a fine-grained expression label (EmoFace135) to perform simultaneous face swapping and reenactment with explicit background conditioning. Key contributions include a balanced identity/expression encoding scheme with compound embeddings, an explicit background conditioning mechanism, and an improved midpoint sampling method, all implemented in a latent-diffusion backbone. Experimental results show high identity and expression fidelity, superior performance relative to text-to-image and existing face manipulation baselines, and strong user-study support, indicating practical potential for controllable, personalized portrait generation. The work advances multi-modal conditioning for diffusion models and broadens the expressive capability of identity-preserving face synthesis.
Abstract
In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted portrait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation framework, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, it's nontrivial to separately and precisely control them in one framework, thus has not been explored yet. To overcome this, we propose several innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods.
