Table of Contents
Fetching ...

High-Quality Mesh Blendshape Generation from Face Videos via Neural Inverse Rendering

Xin Ming, Jiawei Li, Jingwang Ling, Libo Zhang, Feng Xu

TL;DR

The paper addresses reconstructing animation-ready, per-person mesh-based blendshape rigs from RGB videos by combining neural inverse rendering with a topology-aware mesh deformation representation. It introduces a vertex deformation parameterization using differential coordinates and tetrahedral connections, a semantic regularization scheme for blendshape updates, and a neural regressor to synchronize unsynchronized sparse multi-view inputs, all integrated with a mesh-based neural deferred renderer. The approach jointly optimizes the rigid and expressive components to produce high-fidelity, animation-compatible rigs evaluated on Multiface and NeRSemble datasets, outperforming several baselines in point-to-plane geometry errors. The results enable direct use in industrial pipelines (e.g., Blender) and support applications like expression retargeting and novel-view synthesis, with code and data publicly available.

Abstract

Readily editable mesh blendshapes have been widely used in animation pipelines, while recent advancements in neural geometry and appearance representations have enabled high-quality inverse rendering. Building upon these observations, we introduce a novel technique that reconstructs mesh-based blendshape rigs from single or sparse multi-view videos, leveraging state-of-the-art neural inverse rendering. We begin by constructing a deformation representation that parameterizes vertex displacements into differential coordinates with tetrahedral connections, allowing for high-quality vertex deformation on high-resolution meshes. By constructing a set of semantic regulations in this representation, we achieve joint optimization of blendshapes and expression coefficients. Furthermore, to enable a user-friendly multi-view setup with unsynchronized cameras, we propose a neural regressor to model time-varying motion parameters. This approach implicitly considers the time difference across multiple cameras, enhancing the accuracy of motion modeling. Experiments demonstrate that, with the flexible input of single or sparse multi-view videos, we reconstruct personalized high-fidelity blendshapes. These blendshapes are both geometrically and semantically accurate, and they are compatible with industrial animation pipelines. Code and data are available at https://github.com/grignarder/high-quality-blendshape-generation.

High-Quality Mesh Blendshape Generation from Face Videos via Neural Inverse Rendering

TL;DR

The paper addresses reconstructing animation-ready, per-person mesh-based blendshape rigs from RGB videos by combining neural inverse rendering with a topology-aware mesh deformation representation. It introduces a vertex deformation parameterization using differential coordinates and tetrahedral connections, a semantic regularization scheme for blendshape updates, and a neural regressor to synchronize unsynchronized sparse multi-view inputs, all integrated with a mesh-based neural deferred renderer. The approach jointly optimizes the rigid and expressive components to produce high-fidelity, animation-compatible rigs evaluated on Multiface and NeRSemble datasets, outperforming several baselines in point-to-plane geometry errors. The results enable direct use in industrial pipelines (e.g., Blender) and support applications like expression retargeting and novel-view synthesis, with code and data publicly available.

Abstract

Readily editable mesh blendshapes have been widely used in animation pipelines, while recent advancements in neural geometry and appearance representations have enabled high-quality inverse rendering. Building upon these observations, we introduce a novel technique that reconstructs mesh-based blendshape rigs from single or sparse multi-view videos, leveraging state-of-the-art neural inverse rendering. We begin by constructing a deformation representation that parameterizes vertex displacements into differential coordinates with tetrahedral connections, allowing for high-quality vertex deformation on high-resolution meshes. By constructing a set of semantic regulations in this representation, we achieve joint optimization of blendshapes and expression coefficients. Furthermore, to enable a user-friendly multi-view setup with unsynchronized cameras, we propose a neural regressor to model time-varying motion parameters. This approach implicitly considers the time difference across multiple cameras, enhancing the accuracy of motion modeling. Experiments demonstrate that, with the flexible input of single or sparse multi-view videos, we reconstruct personalized high-fidelity blendshapes. These blendshapes are both geometrically and semantically accurate, and they are compatible with industrial animation pipelines. Code and data are available at https://github.com/grignarder/high-quality-blendshape-generation.
Paper Structure (16 sections, 12 equations, 9 figures, 1 table)

This paper contains 16 sections, 12 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: With the input of sparse multi-view face videos (shown on the left), our technique reconstructs personalized mesh-based blendshapes (examples shown on the right) that are ready to be used in the industrial animation pipeline.
  • Figure 2: Method pipeline. We model the human head as a person-specific facial rig that includes a neutral face and a set of blendshapes. This rig is derived from template blendshapes through tetrahedralizing and reparameterizing per-vertex deformation. The head poses $R$,$\boldsymbol{t}$ and expression coefficients $\boldsymbol{\beta}_{exp}$ are regressed from the timestamps corresponding to each frame by a neural synchronization regressor, which achieves implicit synchronization between the multi-view, not fully synchronized videos. Combined with the facial rig, the dynamic face geometry is obtained. Afterwards, a neural rendering MLP renders the corresponding images according to the latent codes, normals, and view directions acquired through differentiable rasterization. Finally, we leverage the rendering loss, landmark loss, and rigging regularization terms to jointly optimize the facial rig, the neural regressor, and the neural rendering MLP.
  • Figure 3: Visualization of the point-to-plane error heatmaps for PointAvatar, NHA, FLARE, HRN, and our method.
  • Figure 4: Comparisons of identity and expression-related facial details between our method and other baselines.
  • Figure 5: Evaluating the effectiveness of the blendshape deformation representation, including differential coordinate reparameterization and tetrahedral connections.
  • ...and 4 more figures