Table of Contents
Fetching ...

Neural Face Skinning for Mesh-agnostic Facial Expression Cloning

Sihun Cha, Serin Yoon, Kwanggyoon Seo, Junyong Noh

TL;DR

The paper tackles mesh-agnostic facial expression retargeting with interpretable control. It introduces a three-encoder, one-decoder network where a skinning encoder predicts per-vertex region weights $\omega_{Skin}$, producing localized influence $z_{LE} = \omega_{Skin} \odot z_{GE}$ that modulates a global expression code $z_{GE}$. By supervising skinning with segmentation ($L_{nll}$) and aligning with FACS-based ICT blendshapes via $L_{BP}$ and $L_{BR}$, the model achieves accurate, region-specific deformations while maintaining intuitive editability. Trained on ICT-Facekit and Multiface, it outperforms baselines in expression fidelity and inverse rigging, and extends to stylized unknown meshes, demonstrating robust, mesh-agnostic retargeting with local-global fusion. The approach offers a practical path toward controllable, high-fidelity facial animation across diverse geometries.

Abstract

Accurately retargeting facial expressions to a face mesh while enabling manipulation is a key challenge in facial animation retargeting. Recent deep-learning methods address this by encoding facial expressions into a global latent code, but they often fail to capture fine-grained details in local regions. While some methods improve local accuracy by transferring deformations locally, this often complicates overall control of the facial expression. To address this, we propose a method that combines the strengths of both global and local deformation models. Our approach enables intuitive control and detailed expression cloning across diverse face meshes, regardless of their underlying structures. The core idea is to localize the influence of the global latent code on the target mesh. Our model learns to predict skinning weights for each vertex of the target face mesh through indirect supervision from predefined segmentation labels. These predicted weights localize the global latent code, enabling precise and region-specific deformations even for meshes with unseen shapes. We supervise the latent code using Facial Action Coding System (FACS)-based blendshapes to ensure interpretability and allow straightforward editing of the generated animation. Through extensive experiments, we demonstrate improved performance over state-of-the-art methods in terms of expression fidelity, deformation transfer accuracy, and adaptability across diverse mesh structures.

Neural Face Skinning for Mesh-agnostic Facial Expression Cloning

TL;DR

The paper tackles mesh-agnostic facial expression retargeting with interpretable control. It introduces a three-encoder, one-decoder network where a skinning encoder predicts per-vertex region weights , producing localized influence that modulates a global expression code . By supervising skinning with segmentation () and aligning with FACS-based ICT blendshapes via and , the model achieves accurate, region-specific deformations while maintaining intuitive editability. Trained on ICT-Facekit and Multiface, it outperforms baselines in expression fidelity and inverse rigging, and extends to stylized unknown meshes, demonstrating robust, mesh-agnostic retargeting with local-global fusion. The approach offers a practical path toward controllable, high-fidelity facial animation across diverse geometries.

Abstract

Accurately retargeting facial expressions to a face mesh while enabling manipulation is a key challenge in facial animation retargeting. Recent deep-learning methods address this by encoding facial expressions into a global latent code, but they often fail to capture fine-grained details in local regions. While some methods improve local accuracy by transferring deformations locally, this often complicates overall control of the facial expression. To address this, we propose a method that combines the strengths of both global and local deformation models. Our approach enables intuitive control and detailed expression cloning across diverse face meshes, regardless of their underlying structures. The core idea is to localize the influence of the global latent code on the target mesh. Our model learns to predict skinning weights for each vertex of the target face mesh through indirect supervision from predefined segmentation labels. These predicted weights localize the global latent code, enabling precise and region-specific deformations even for meshes with unseen shapes. We supervise the latent code using Facial Action Coding System (FACS)-based blendshapes to ensure interpretability and allow straightforward editing of the generated animation. Through extensive experiments, we demonstrate improved performance over state-of-the-art methods in terms of expression fidelity, deformation transfer accuracy, and adaptability across diverse mesh structures.

Paper Structure

This paper contains 14 sections, 9 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 2: Method overview. Illustration of the data flow at inference (a) and training (b). For simplicity, the encoders are omitted from (b) and the dotted red box indicates the losses that are exclusively applied to the ICT data.
  • Figure 3: Illustration of the encoder (a) and decoder (b) architectures. S.B. indicates the skinning block.
  • Figure 4: Overlaping regions between blendshape. The ICT blendshapes for 'EyeSquintRight', 'EyeBlinkRight', 'EyeLookUpRight', and 'BrowOuterUpRight' (a) and the corresponding deformed region (b).
  • Figure 5: Visual comparison of expression quality produced by our method and comparative methods. The MSE between the GT and the predicted face mesh is colored using a yellow-orange-red color map (YlOrRd).
  • Figure 6: Visual comparison of inverse rigging quality produced by our method and comparative methods. The expression codes were predicted from the source face mesh and were used to reconstruct a face mesh using the ICT blendshape. The incurred deformation on the face is colored using a yellow-orange-red color map (YlOrRd).
  • ...and 4 more figures