Table of Contents
Fetching ...

RigAnyFace: Scaling Neural Facial Mesh Auto-Rigging with Unlabeled Data

Wenchao Ma, Dario Kneubuehler, Maurice Chu, Ian Sachs, Haomiao Jiang, Sharon Xiaolei Huang

TL;DR

RigAnyFace tackles the challenge of auto-rigging facial meshes with diverse topologies, including multiple disconnected components, by deforming a neutral mesh into FACS poses using a triangulation-agnostic DiffusionNet backbone conditioned on FACS. A global encoder and a carefully designed 2D supervision pipeline—leveraging 2D appearance and motion signals from differentiable rendering and a MegActor-based animation model—enables scalable training on unlabeled data alongside a smaller set of artist-rigged ground truth. The method achieves state-of-the-art accuracy and generalization, demonstrably handling in-the-wild heads and complex components such as eyeballs, teeth, and gums, while enabling downstream applications like user-controlled animation, video-to-mesh retargeting, and text-to-3D rigging. This work lowers the barrier to high-quality facial rigs and broadens expressive avatar creation, albeit with limitations on shell-like geometries and extreme discretization artifacts that warrant future study.

Abstract

In this paper, we present RigAnyFace (RAF), a scalable neural auto-rigging framework for facial meshes of diverse topologies, including those with multiple disconnected components. RAF deforms a static neutral facial mesh into industry-standard FACS poses to form an expressive blendshape rig. Deformations are predicted by a triangulation-agnostic surface learning network augmented with our tailored architecture design to condition on FACS parameters and efficiently process disconnected components. For training, we curated a dataset of facial meshes, with a subset meticulously rigged by professional artists to serve as accurate 3D ground truth for deformation supervision. Due to the high cost of manual rigging, this subset is limited in size, constraining the generalization ability of models trained exclusively on it. To address this, we design a 2D supervision strategy for unlabeled neutral meshes without rigs. This strategy increases data diversity and allows for scaled training, thereby enhancing the generalization ability of models trained on this augmented data. Extensive experiments demonstrate that RAF is able to rig meshes of diverse topologies on not only our artist-crafted assets but also in-the-wild samples, outperforming previous works in accuracy and generalizability. Moreover, our method advances beyond prior work by supporting multiple disconnected components, such as eyeballs, for more detailed expression animation. Project page: https://wenchao-m.github.io/RigAnyFace.github.io

RigAnyFace: Scaling Neural Facial Mesh Auto-Rigging with Unlabeled Data

TL;DR

RigAnyFace tackles the challenge of auto-rigging facial meshes with diverse topologies, including multiple disconnected components, by deforming a neutral mesh into FACS poses using a triangulation-agnostic DiffusionNet backbone conditioned on FACS. A global encoder and a carefully designed 2D supervision pipeline—leveraging 2D appearance and motion signals from differentiable rendering and a MegActor-based animation model—enables scalable training on unlabeled data alongside a smaller set of artist-rigged ground truth. The method achieves state-of-the-art accuracy and generalization, demonstrably handling in-the-wild heads and complex components such as eyeballs, teeth, and gums, while enabling downstream applications like user-controlled animation, video-to-mesh retargeting, and text-to-3D rigging. This work lowers the barrier to high-quality facial rigs and broadens expressive avatar creation, albeit with limitations on shell-like geometries and extreme discretization artifacts that warrant future study.

Abstract

In this paper, we present RigAnyFace (RAF), a scalable neural auto-rigging framework for facial meshes of diverse topologies, including those with multiple disconnected components. RAF deforms a static neutral facial mesh into industry-standard FACS poses to form an expressive blendshape rig. Deformations are predicted by a triangulation-agnostic surface learning network augmented with our tailored architecture design to condition on FACS parameters and efficiently process disconnected components. For training, we curated a dataset of facial meshes, with a subset meticulously rigged by professional artists to serve as accurate 3D ground truth for deformation supervision. Due to the high cost of manual rigging, this subset is limited in size, constraining the generalization ability of models trained exclusively on it. To address this, we design a 2D supervision strategy for unlabeled neutral meshes without rigs. This strategy increases data diversity and allows for scaled training, thereby enhancing the generalization ability of models trained on this augmented data. Extensive experiments demonstrate that RAF is able to rig meshes of diverse topologies on not only our artist-crafted assets but also in-the-wild samples, outperforming previous works in accuracy and generalizability. Moreover, our method advances beyond prior work by supporting multiple disconnected components, such as eyeballs, for more detailed expression animation. Project page: https://wenchao-m.github.io/RigAnyFace.github.io

Paper Structure

This paper contains 25 sections, 3 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: We present RigAnyFace (RAF), an auto-rigging framework that supports facial meshes of diverse topologies with multiple disconnected components such as eyeballs. These meshes are drawn from diverse sources and cover both humanoid and non-humanoid heads. Given only a neutral facial mesh and explicitly controllable FACS parameters specifying activated action units, RAF accurately deforms the input mesh into corresponding FACS poses, creating an expressive blendshape rig.
  • Figure 2: (a) Illustration of our artist-crafted facial mesh dataset. (i) Neutral head meshes from our dataset, each consisting of multiple disconnected components. (ii) A subset of neutral head meshes is meticulously annotated with blendshape rigs by professional artists. (iii) To augment the dataset, we develop a head interpolation strategy based on standardized UV layouts. (b) 2D Supervision Generation Pipeline: Given a posed image rendered from a rigged head and a neutral image from an unrigged head, the 2D animation model generates an image that replicates the expression in the posed image while preserving the identity of the neutral image. A flow estimation model is then applied to the neutral and generated posed images to predict the pixel offsets as 2D displacement.
  • Figure 3: Model Architecture. (a) Given a neutral facial mesh, our deformation model predicts the 3D displacement needed to deform the mesh into different expressions based on the input FACS vector. During training, 2D supervision is utilized for both rigged and unrigged heads, while 3D supervision is exclusively applied to rigged heads. (b) We modify the original diffusion block in DiffusionNet to support the FACS vector as an additional conditional inputs (left). Additionally, we design a global encoder that processes vertex positions and normals of the neutral facial mesh to capture holistic information across disconnected components (right).
  • Figure 3: Ablation on the global encoder.
  • Figure 4: Illustration of our 2D displacement supervision (d), which provides denser feedback for the subtle pose differences between (a) and (b) than the appearance-level supervision (c). Subfigure (c) visualizes per-pixel color-difference magnitudes between (a) and (b), whereas subfigure (d) shows the corresponding pixel offsets using the standard optical-flow color map.
  • ...and 9 more figures