Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos

Alfredo Rivero; ShahRukh Athar; Zhixin Shu; Dimitris Samaras

Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos

Alfredo Rivero, ShahRukh Athar, Zhixin Shu, Dimitris Samaras

TL;DR

Rig3DGS addresses the problem of controllable 3D portrait rendering from casual monocular videos by representing the scene as a canonical collection of 3D Gaussian splats and deforming them with a learnable prior derived from a 3D morphable model. The core idea restricts per Gaussian deformations to a subspace spanned by nearby FLAME vertices, enabling stable reanimation of arbitrary expressions and head poses while supporting novel-view synthesis of the entire scene. The method achieves higher fidelity and faster training and rendering than prior baselines such as RigNeRF, INSTA, and PointAvatar, and its learnable prior is shown to be essential via ablations. While effective for photorealistic reanimation, the approach assumes controlled illumination and relatively still capture, suggesting directions for future work in lighting robustness and dynamic motion.

Abstract

Creating controllable 3D human portraits from casual smartphone videos is highly desirable due to their immense value in AR/VR applications. The recent development of 3D Gaussian Splatting (3DGS) has shown improvements in rendering quality and training efficiency. However, it still remains a challenge to accurately model and disentangle head movements and facial expressions from a single-view capture to achieve high-quality renderings. In this paper, we introduce Rig3DGS to address this challenge. We represent the entire scene, including the dynamic subject, using a set of 3D Gaussians in a canonical space. Using a set of control signals, such as head pose and expressions, we transform them to the 3D space with learned deformations to generate the desired rendering. Our key innovation is a carefully designed deformation method which is guided by a learnable prior derived from a 3D morphable model. This approach is highly efficient in training and effective in controlling facial expressions, head positions, and view synthesis across various captures. We demonstrate the effectiveness of our learned deformation through extensive quantitative and qualitative experiments. The project page can be found at http://shahrukhathar.github.io/2024/02/05/Rig3DGS.html

Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos

TL;DR

Abstract

Paper Structure (24 sections, 12 equations, 6 figures, 4 tables)

This paper contains 24 sections, 12 equations, 6 figures, 4 tables.

Introduction
Related Work
Neural Scene Representations and Novel View Synthesis.
Dynamic Neural Scene Representations.
Controllable Face Generation.
Rig3DGS
Preliminaries
3D Gaussian Splatting
Deforming Gaussians with a Learnable Prior
Rotating and Scaling Gaussians
Full Loss
Results
Baseline Approaches
Training Data Capture
Evaluation on Test Data
...and 9 more sections

Figures (6)

Figure 1: Rig3DGS. Our method, Rig3DGS, enables the creation of reanimatable portraits with full control over facial expressions, head-pose of a subject and the viewing direction of the entire scene they're in. Rig3DGS uses a learnable prior based deformation field to ensure photoralistic reanimation and generalization to novel expressions and head-poses.
Figure 2: Rig3DGS. Our method models the dynamic scene as a collection of 3D Gaussians in the canonical space that are deformed according to the target facial expression and head-pose to the deformed space before being rendered via differentiable splatting. We constrain the deformation to lie in the sub-space of local vertex deformation, which allows us to generate photorealistic renders with high fidelity to the target expression and head-pose.
Figure 3: Qualitative comparison of Subjects 1-7 in Setting 1. Rig3DGS produces full-scene renders with higher-quality facial and background detail than competing baselines. Unfortunately, PointAvatar's results for subject 6 never converged despite 3 different experimental trials.
Figure 4: Ablation of the Learnable Deformation Prior. As can be seen, the learnable prior, as defined by Eq. (\ref{['eq:deform_def_eq_final']}), is able to model the target expression and head-pose better than the fixed prior (see highlighted regions). The model with no prior fails to reanimate altogether.
Figure 5: Sample renders of Subjects 1, 3, and 5 reanimated by different expression and head pose donors. We refer the reader to our supplementary video material for a more comprehensive evaluation.
...and 1 more figures

Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos

TL;DR

Abstract

Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (6)