Table of Contents
Fetching ...

Near-realtime Facial Animation by Deep 3D Simulation Super-Resolution

Hyojoon Park, Sangeetha Grama Srinivasan, Matthew Cong, Doyub Kim, Byungsoo Kim, Jonathan Swartz, Ken Museth, Eftychios Sifakis

TL;DR

The paper tackles the challenge of producing high-fidelity facial animation in real time by coupling a fast, low-resolution physics-based face simulator with a neural network that upscales to high-resolution detail. It trains on semantically aligned frame pairs generated by matching muscle activations and bone poses across resolutions, enabling the model to compensate for cross-resolution discrepancies and generalize to unseen expressions and even dynamic effects. The proposed three-component network (Feature Encoding, Coordinate-based Upsampling, and Surface Reconstruction) along with a composite loss achieves near-realtime end-to-end performance (~18.5 FPS) while delivering high-fidelity HR surfaces that closely match offline HR simulations. The framework is validated through comprehensive qualitative and quantitative experiments, ablations, and blendshape inputs, illustrating robust generalization, efficient inference, and practical impact for real-time, physically informed facial animation.

Abstract

We present a neural network-based simulation super-resolution framework that can efficiently and realistically enhance a facial performance produced by a low-cost, realtime physics-based simulation to a level of detail that closely approximates that of a reference-quality off-line simulator with much higher resolution (26x element count in our examples) and accurate physical modeling. Our approach is rooted in our ability to construct - via simulation - a training set of paired frames, from the low- and high-resolution simulators respectively, that are in semantic correspondence with each other. We use face animation as an exemplar of such a simulation domain, where creating this semantic congruence is achieved by simply dialing in the same muscle actuation controls and skeletal pose in the two simulators. Our proposed neural network super-resolution framework generalizes from this training set to unseen expressions, compensates for modeling discrepancies between the two simulations due to limited resolution or cost-cutting approximations in the real-time variant, and does not require any semantic descriptors or parameters to be provided as input, other than the result of the real-time simulation. We evaluate the efficacy of our pipeline on a variety of expressive performances and provide comparisons and ablation experiments for plausible variations and alternatives to our proposed scheme.

Near-realtime Facial Animation by Deep 3D Simulation Super-Resolution

TL;DR

The paper tackles the challenge of producing high-fidelity facial animation in real time by coupling a fast, low-resolution physics-based face simulator with a neural network that upscales to high-resolution detail. It trains on semantically aligned frame pairs generated by matching muscle activations and bone poses across resolutions, enabling the model to compensate for cross-resolution discrepancies and generalize to unseen expressions and even dynamic effects. The proposed three-component network (Feature Encoding, Coordinate-based Upsampling, and Surface Reconstruction) along with a composite loss achieves near-realtime end-to-end performance (~18.5 FPS) while delivering high-fidelity HR surfaces that closely match offline HR simulations. The framework is validated through comprehensive qualitative and quantitative experiments, ablations, and blendshape inputs, illustrating robust generalization, efficient inference, and practical impact for real-time, physically informed facial animation.

Abstract

We present a neural network-based simulation super-resolution framework that can efficiently and realistically enhance a facial performance produced by a low-cost, realtime physics-based simulation to a level of detail that closely approximates that of a reference-quality off-line simulator with much higher resolution (26x element count in our examples) and accurate physical modeling. Our approach is rooted in our ability to construct - via simulation - a training set of paired frames, from the low- and high-resolution simulators respectively, that are in semantic correspondence with each other. We use face animation as an exemplar of such a simulation domain, where creating this semantic congruence is achieved by simply dialing in the same muscle actuation controls and skeletal pose in the two simulators. Our proposed neural network super-resolution framework generalizes from this training set to unseen expressions, compensates for modeling discrepancies between the two simulations due to limited resolution or cost-cutting approximations in the real-time variant, and does not require any semantic descriptors or parameters to be provided as input, other than the result of the real-time simulation. We evaluate the efficacy of our pipeline on a variety of expressive performances and provide comparisons and ablation experiments for plausible variations and alternatives to our proposed scheme.
Paper Structure (64 sections, 16 equations, 20 figures, 5 tables)

This paper contains 64 sections, 16 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: The overview of our pipeline for 3D simulation super-resolution aiming at learning a mapping from a low-resolution (LR) volumetric mesh to a high-resolution (HR) surface mesh. Our pipeline is comprised of (1) Feature Encoding, (2) Coordinate-based Upsampling, and (3) Surface Reconstruction modules. The input and output are sets of 3D displacement vectors from the LR and HR rest pose shapes, respectively. © NVIDIA
  • Figure 2: Illustration of finding the $k$ nearest vertices $\{\mathbf{x}^L_1, ..., \mathbf{x}^L_k\}$ (where $i,...,k \in \mathcal{N}_j$) on the LR mesh to the vertex $\mathbf{x}^H_j$ on the HR mesh using geodesic distances. © NVIDIA
  • Figure 3: (a) High-resolution surface model in dimensions of $289.0 \times 342.7 \times291.1$ [mm] w.r.t. $x$, $y$, and $z$ axis, respectively, including the part of the shoulder, (b) high-resolution simulation model (0.16 FPS simulation), (c) low-resolution simulation model (30.06 FPS simulation) for the near-realtime end-to-end animation at 18.46 FPS, and (d) coarser low-resolution simulation model (67.79 FPS simulation) for the true real-time end-to-end animation at 28.04 FPS. © NVIDIA
  • Figure 4: The face surface embedded in the non-conforming low-resolution volumetric mesh with 73 thousand tetrahedra (left) deviates significantly from the same surface simulated using a conforming high-resolution mesh with 1.9 million tetrahedra (right), even though both deformations are parameterized using the same blend shape weights and jaw transformation. We zoom into different regions of the face to highlight macro and microscopic discrepancies. © NVIDIA
  • Figure 5: Frame-wise mean surface reconstruction error of unseen facial expressions for each tested model. Our method (in red line) achieved the lowest mean error across every test frame.
  • ...and 15 more figures