Table of Contents
Fetching ...

ControlFace: Harnessing Facial Parametric Control for Face Rigging

Wooseok Jang, Youngjun Hong, Geonho Cha, Seungryong Kim

TL;DR

ControlFace tackles face rigging by enabling precise, identity-preserving edits driven by explicit 3DMM renderings. It employs a dual-branch U-Network architecture (FaceNet for reference appearance and a denoising U-Net for generation) integrated via augmented self-attention, plus a Control Mixer Module and Reference Control Guidance to tightly couple target and reference controls. Training on facial video data using quadruplets {X_R,X_T,D_R,D_T} eliminates reconstruction bias and leverages rich identity cues from FaceNet. Across qualitative, quantitative, and user studies, ControlFace achieves superior control adherence, identity preservation, and image quality compared to baselines, showcasing practical applicability without per-identity fine-tuning. The approach advances face rigging by combining 3DMM-based renderings, grounded guidance, and efficient conditioning modules for robust real-world use.

Abstract

Manipulation of facial images to meet specific controls such as pose, expression, and lighting, also known as face rigging, is a complex task in computer vision. Existing methods are limited by their reliance on image datasets, which necessitates individual-specific fine-tuning and limits their ability to retain fine-grained identity and semantic details, reducing practical usability. To overcome these limitations, we introduce ControlFace, a novel face rigging method conditioned on 3DMM renderings that enables flexible, high-fidelity control. We employ a dual-branch U-Nets: one, referred to as FaceNet, captures identity and fine details, while the other focuses on generation. To enhance control precision, the control mixer module encodes the correlated features between the target-aligned control and reference-aligned control, and a novel guidance method, reference control guidance, steers the generation process for better control adherence. By training on a facial video dataset, we fully utilize FaceNet's rich representations while ensuring control adherence. Extensive experiments demonstrate ControlFace's superior performance in identity preservation and control precision, highlighting its practicality. Please see the project website: https://cvlab-kaist.github.io/ControlFace/.

ControlFace: Harnessing Facial Parametric Control for Face Rigging

TL;DR

ControlFace tackles face rigging by enabling precise, identity-preserving edits driven by explicit 3DMM renderings. It employs a dual-branch U-Network architecture (FaceNet for reference appearance and a denoising U-Net for generation) integrated via augmented self-attention, plus a Control Mixer Module and Reference Control Guidance to tightly couple target and reference controls. Training on facial video data using quadruplets {X_R,X_T,D_R,D_T} eliminates reconstruction bias and leverages rich identity cues from FaceNet. Across qualitative, quantitative, and user studies, ControlFace achieves superior control adherence, identity preservation, and image quality compared to baselines, showcasing practical applicability without per-identity fine-tuning. The approach advances face rigging by combining 3DMM-based renderings, grounded guidance, and efficient conditioning modules for robust real-world use.

Abstract

Manipulation of facial images to meet specific controls such as pose, expression, and lighting, also known as face rigging, is a complex task in computer vision. Existing methods are limited by their reliance on image datasets, which necessitates individual-specific fine-tuning and limits their ability to retain fine-grained identity and semantic details, reducing practical usability. To overcome these limitations, we introduce ControlFace, a novel face rigging method conditioned on 3DMM renderings that enables flexible, high-fidelity control. We employ a dual-branch U-Nets: one, referred to as FaceNet, captures identity and fine details, while the other focuses on generation. To enhance control precision, the control mixer module encodes the correlated features between the target-aligned control and reference-aligned control, and a novel guidance method, reference control guidance, steers the generation process for better control adherence. By training on a facial video dataset, we fully utilize FaceNet's rich representations while ensuring control adherence. Extensive experiments demonstrate ControlFace's superior performance in identity preservation and control precision, highlighting its practicality. Please see the project website: https://cvlab-kaist.github.io/ControlFace/.

Paper Structure

This paper contains 42 sections, 7 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Our ControlFace can edit the input face image using explicit facial parametric controls, generating realistic images without compromising the identity and other semantic details such as hairstyle.
  • Figure 2: Limitations of reconstruction-based training. We compare the results of our model trained on an image dataset karras2019style in a reconstruction setup and on a video dataset zhu2022celebvhq with paired samples created by randomly selecting two frames from each video. The results by reconstruction-based training often ignores the target control at inference.
  • Figure 3: Overall Architecture. ControlFace encodes the reference image $X_{R}$ into the FaceNet and CLIP image encoder for identity and semantic preservation. For face control, the target control $D_{T}$ is incorporated into the denoising U-Net through face controller. To enhance the control adherence, the correlated feature between reference control $D_{R}$ and target control $D_{T}$ is acquired from the proposed control mixer module.
  • Figure 4: Visualization of Reference Control Guidance. We visualize the deltas, $\epsilon_{\theta}(\cdot, D_{T}) - \epsilon_{\theta}(\cdot, \varnothing)$ and $\epsilon_{\theta}(\cdot, D_{T}) - \epsilon_{\theta}(\cdot, D_{R})$, which corresponds to CFG ho2022classifier applied to face controller input and RCG, respectively, across different timesteps $t$. The first and third row display RCG deltas whereas second and fourth row show the CFG deltas. The former shows noisy deltas over all the timesteps.
  • Figure 5: Qualitative Results. We compare the results of rigging pose, expression, and light with four different baselines ghosh2020gifliang2024caphumanpapantoniou2024arc2faceding2023diffusionrig. The reference images are from the FFHQ karras2019style dataset. Compared to the baselines ControlFace aligns with the target control, maintaining the identity and details in the reference image.
  • ...and 12 more figures