Table of Contents
Fetching ...

ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion

Foivos Paraperas Papantoniou, Stefanos Zafeiriou

TL;DR

This work tackles ID-consistent, fine-grained facial expression generation by extending a diffusion backbone (Arc2Face) with an Expression Adapter that injects explicit FLAME blendshape parameters into the CLIP latent space via a dual-attention mechanism, enabling precise, disentangled control over expressions without degrading identity. It further introduces a Reference Adapter for image-based expression editing, using a frozen reference UNet and LoRA modulation to preserve appearance and background while enabling expression transfer, with training on expression-rich datasets and cross-paired video data. The results show superior expression fidelity and identity preservation compared to state-of-the-art baselines in both identity-driven and reference-driven settings, supported by objective metrics and a user study. The work contributes a practical, open-source framework for controllable, high-fidelity face synthesis with broad potential for storytelling, FER research, and synthetic data generation, while acknowledging ethical considerations around synthetic facial content detection and misuse risk.

Abstract

Human-centric generative models designed for AI-driven storytelling must bring together two core capabilities: identity consistency and precise control over human performance. While recent diffusion-based approaches have made significant progress in maintaining facial identity, achieving fine-grained expression control without compromising identity remains challenging. In this work, we present a diffusion-based framework that faithfully reimagines any subject under any particular facial expression. Building on an ID-consistent face foundation model, we adopt a compositional design featuring an expression cross-attention module guided by FLAME blendshape parameters for explicit control. Trained on a diverse mixture of image and video data rich in expressive variation, our adapter generalizes beyond basic emotions to subtle micro-expressions and expressive transitions, overlooked by prior works. In addition, a pluggable Reference Adapter enables expression editing in real images by transferring the appearance from a reference frame during synthesis. Extensive quantitative and qualitative evaluations show that our model outperforms existing methods in tailored and identity-consistent expression generation. Code and models can be found at https://github.com/foivospar/Arc2Face.

ID-Consistent, Precise Expression Generation with Blendshape-Guided Diffusion

TL;DR

This work tackles ID-consistent, fine-grained facial expression generation by extending a diffusion backbone (Arc2Face) with an Expression Adapter that injects explicit FLAME blendshape parameters into the CLIP latent space via a dual-attention mechanism, enabling precise, disentangled control over expressions without degrading identity. It further introduces a Reference Adapter for image-based expression editing, using a frozen reference UNet and LoRA modulation to preserve appearance and background while enabling expression transfer, with training on expression-rich datasets and cross-paired video data. The results show superior expression fidelity and identity preservation compared to state-of-the-art baselines in both identity-driven and reference-driven settings, supported by objective metrics and a user study. The work contributes a practical, open-source framework for controllable, high-fidelity face synthesis with broad potential for storytelling, FER research, and synthetic data generation, while acknowledging ethical considerations around synthetic facial content detection and misuse risk.

Abstract

Human-centric generative models designed for AI-driven storytelling must bring together two core capabilities: identity consistency and precise control over human performance. While recent diffusion-based approaches have made significant progress in maintaining facial identity, achieving fine-grained expression control without compromising identity remains challenging. In this work, we present a diffusion-based framework that faithfully reimagines any subject under any particular facial expression. Building on an ID-consistent face foundation model, we adopt a compositional design featuring an expression cross-attention module guided by FLAME blendshape parameters for explicit control. Trained on a diverse mixture of image and video data rich in expressive variation, our adapter generalizes beyond basic emotions to subtle micro-expressions and expressive transitions, overlooked by prior works. In addition, a pluggable Reference Adapter enables expression editing in real images by transferring the appearance from a reference frame during synthesis. Extensive quantitative and qualitative evaluations show that our model outperforms existing methods in tailored and identity-consistent expression generation. Code and models can be found at https://github.com/foivospar/Arc2Face.

Paper Structure

This paper contains 17 sections, 1 equation, 13 figures, 3 tables.

Figures (13)

  • Figure 1: We introduce a fine-grained expression adapter on a foundation ID-consistent face model, which significantly outperforms existing approaches in expression-transfer fidelity and can apply any type of facial expression to any given subject - including extreme or asymmetric ones - using explicit blendshape parameters (left). Our method can optionally integrate a reference adapter that enables expression editing without altering the appearance or background (right).
  • Figure 2: Overview of the proposed expression-control framework. Our approach builds on the Arc2Face diffusion model paraperas2024arc2face, which conditions the denoising UNet on ID embeddings. (a) We introduce an Expression Adapter that guides the generation using explicit FLAME FLAME:SiggraphAsia2017 blendshape parameters extracted from reference images via an off-the-shelf 3D reconstruction method smirk_2024_CVPR. The adapter consists of two components: (1) an MLP that maps 3DMM parameters into the CLIP latent space used by Stable Diffusion, and (2) a dual attention mechanism that integrates expression information alongside identity using separate key and value matrices in the cross-attention layers. (b) We further incorporate a Reference Adapter that conditions the model on the input image itself. A dedicated Reference UNet extracts multi-scale features, which are injected into the denoising UNet via self-attention layers modulated by added LoRA weights. By enabling this adapter at inference, we support image-based expression editing. The proposed modules are trained on expression-rich image and video data, achieving strong generalization across a wide range of facial expressions.
  • Figure 3: Visual comparison of our method with competing models varanka2024finefaceliang2024caphuman for expression-controlled generation conditioned on identity features.
  • Figure 4: Cosine similarity distribution between identity features of input and generated faces.
  • Figure 5: Users' preference on accuracy between generated/intended expressions.
  • ...and 8 more figures