ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion
Foivos Paraperas Papantoniou, Stefanos Zafeiriou
TL;DR
This work tackles ID-consistent, fine-grained facial expression generation by extending a diffusion backbone (Arc2Face) with an Expression Adapter that injects explicit FLAME blendshape parameters into the CLIP latent space via a dual-attention mechanism, enabling precise, disentangled control over expressions without degrading identity. It further introduces a Reference Adapter for image-based expression editing, using a frozen reference UNet and LoRA modulation to preserve appearance and background while enabling expression transfer, with training on expression-rich datasets and cross-paired video data. The results show superior expression fidelity and identity preservation compared to state-of-the-art baselines in both identity-driven and reference-driven settings, supported by objective metrics and a user study. The work contributes a practical, open-source framework for controllable, high-fidelity face synthesis with broad potential for storytelling, FER research, and synthetic data generation, while acknowledging ethical considerations around synthetic facial content detection and misuse risk.
Abstract
Human-centric generative models designed for AI-driven storytelling must bring together two core capabilities: identity consistency and precise control over human performance. While recent diffusion-based approaches have made significant progress in maintaining facial identity, achieving fine-grained expression control without compromising identity remains challenging. In this work, we present a diffusion-based framework that faithfully reimagines any subject under any particular facial expression. Building on an ID-consistent face foundation model, we adopt a compositional design featuring an expression cross-attention module guided by FLAME blendshape parameters for explicit control. Trained on a diverse mixture of image and video data rich in expressive variation, our adapter generalizes beyond basic emotions to subtle micro-expressions and expressive transitions, overlooked by prior works. In addition, a pluggable Reference Adapter enables expression editing in real images by transferring the appearance from a reference frame during synthesis. Extensive quantitative and qualitative evaluations show that our model outperforms existing methods in tailored and identity-consistent expression generation. Code and models can be found at https://github.com/foivospar/Arc2Face.
