Near-realtime Facial Animation by Deep 3D Simulation Super-Resolution
Hyojoon Park, Sangeetha Grama Srinivasan, Matthew Cong, Doyub Kim, Byungsoo Kim, Jonathan Swartz, Ken Museth, Eftychios Sifakis
TL;DR
The paper tackles the challenge of producing high-fidelity facial animation in real time by coupling a fast, low-resolution physics-based face simulator with a neural network that upscales to high-resolution detail. It trains on semantically aligned frame pairs generated by matching muscle activations and bone poses across resolutions, enabling the model to compensate for cross-resolution discrepancies and generalize to unseen expressions and even dynamic effects. The proposed three-component network (Feature Encoding, Coordinate-based Upsampling, and Surface Reconstruction) along with a composite loss achieves near-realtime end-to-end performance (~18.5 FPS) while delivering high-fidelity HR surfaces that closely match offline HR simulations. The framework is validated through comprehensive qualitative and quantitative experiments, ablations, and blendshape inputs, illustrating robust generalization, efficient inference, and practical impact for real-time, physically informed facial animation.
Abstract
We present a neural network-based simulation super-resolution framework that can efficiently and realistically enhance a facial performance produced by a low-cost, realtime physics-based simulation to a level of detail that closely approximates that of a reference-quality off-line simulator with much higher resolution (26x element count in our examples) and accurate physical modeling. Our approach is rooted in our ability to construct - via simulation - a training set of paired frames, from the low- and high-resolution simulators respectively, that are in semantic correspondence with each other. We use face animation as an exemplar of such a simulation domain, where creating this semantic congruence is achieved by simply dialing in the same muscle actuation controls and skeletal pose in the two simulators. Our proposed neural network super-resolution framework generalizes from this training set to unseen expressions, compensates for modeling discrepancies between the two simulations due to limited resolution or cost-cutting approximations in the real-time variant, and does not require any semantic descriptors or parameters to be provided as input, other than the result of the real-time simulation. We evaluate the efficacy of our pipeline on a variety of expressive performances and provide comparisons and ablation experiments for plausible variations and alternatives to our proposed scheme.
