Neural Face Skinning for Mesh-agnostic Facial Expression Cloning
Sihun Cha, Serin Yoon, Kwanggyoon Seo, Junyong Noh
TL;DR
The paper tackles mesh-agnostic facial expression retargeting with interpretable control. It introduces a three-encoder, one-decoder network where a skinning encoder predicts per-vertex region weights $\omega_{Skin}$, producing localized influence $z_{LE} = \omega_{Skin} \odot z_{GE}$ that modulates a global expression code $z_{GE}$. By supervising skinning with segmentation ($L_{nll}$) and aligning with FACS-based ICT blendshapes via $L_{BP}$ and $L_{BR}$, the model achieves accurate, region-specific deformations while maintaining intuitive editability. Trained on ICT-Facekit and Multiface, it outperforms baselines in expression fidelity and inverse rigging, and extends to stylized unknown meshes, demonstrating robust, mesh-agnostic retargeting with local-global fusion. The approach offers a practical path toward controllable, high-fidelity facial animation across diverse geometries.
Abstract
Accurately retargeting facial expressions to a face mesh while enabling manipulation is a key challenge in facial animation retargeting. Recent deep-learning methods address this by encoding facial expressions into a global latent code, but they often fail to capture fine-grained details in local regions. While some methods improve local accuracy by transferring deformations locally, this often complicates overall control of the facial expression. To address this, we propose a method that combines the strengths of both global and local deformation models. Our approach enables intuitive control and detailed expression cloning across diverse face meshes, regardless of their underlying structures. The core idea is to localize the influence of the global latent code on the target mesh. Our model learns to predict skinning weights for each vertex of the target face mesh through indirect supervision from predefined segmentation labels. These predicted weights localize the global latent code, enabling precise and region-specific deformations even for meshes with unseen shapes. We supervise the latent code using Facial Action Coding System (FACS)-based blendshapes to ensure interpretability and allow straightforward editing of the generated animation. Through extensive experiments, we demonstrate improved performance over state-of-the-art methods in terms of expression fidelity, deformation transfer accuracy, and adaptability across diverse mesh structures.
