Table of Contents
Fetching ...

Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture

Haodi He, Jihun Yu, Ronald Fedkiw

TL;DR

The paper presents a pipeline to reconstruct high-fidelity facial geometry and textures from uncalibrated multi-view images using Gaussian Splatting, tightly coupling Gaussians to a triangulated mesh via soft geometric constraints and semantic segmentation. It introduces a texture-space neural texture approach that relights and decomposes texture from lighting using PCA-albedo priors, enabling de-lit textures from limited data without a light-stage. The method supports training on disparate captures and culminates in MetaHuman generation, offering an animatable, relightable asset compatible with standard graphics pipelines. Experiments compare with prior work showing improved geometry alignment and robust de-lighting, and demonstrate text-driven asset creation pipelines.

Abstract

We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.

Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture

TL;DR

The paper presents a pipeline to reconstruct high-fidelity facial geometry and textures from uncalibrated multi-view images using Gaussian Splatting, tightly coupling Gaussians to a triangulated mesh via soft geometric constraints and semantic segmentation. It introduces a texture-space neural texture approach that relights and decomposes texture from lighting using PCA-albedo priors, enabling de-lit textures from limited data without a light-stage. The method supports training on disparate captures and culminates in MetaHuman generation, offering an animatable, relightable asset compatible with standard graphics pipelines. Experiments compare with prior work showing improved geometry alignment and robust de-lighting, and demonstrate text-driven asset creation pipelines.

Abstract

We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.

Paper Structure

This paper contains 35 sections, 16 equations, 22 figures.

Figures (22)

  • Figure 1: Using a small number of self-captured uncalibrated multi-view images, we use segmentation annotations along with size and shape constraints to force the Gaussians to move instead of deform. In addition, soft constraints are used to keep the Gaussians tightly coupled and close to the triangulated surface. After training, in a post process, the triangulated surface is deformed to better approximate the Gaussian reconstruction.
  • Figure 2: The ground-truth is well-reconstructed (column 1) not only by our method but also in the ablation tests where segmentation supervision (row 2) and constraints (row 3) have been omitted. Omitting segmentation supervision allows the Gaussians to incorrectly explain regions of the image that their triangles should not be associated with (compare row 2 column 2 to row 1 column 2), resulting in spurious geometry (row 2 column 3). Omitting soft constraints disconnects the Gaussians from their triangles resulting in very spurious geometry (row 3 column 3).
  • Figure 3: The 11 predefined target head poses used for reconstruction.
  • Figure 4: The annotated texture map (left) used to create training data: face (red), nose (blue), nostril (pink), top lip (green), bottom lip (cyan), eyes (yellow), ears (dark red), non-face (white). The labeled image (middle) has an additional (black) background label. Note how the (black) background label has been expanded to portions of the foreground in order to emphasize occlusion boundaries. The segmentation network is trained to recover Fig. \ref{['fig:seg_example']} (middle) from Fig. \ref{['fig:seg_example']} (right).
  • Figure 5: The template texture (left) used to assign labels to Gaussians: face (red), nose (blue), nostril (pink), top lip (green), bottom lip (cyan), eyes (yellow), ears (dark red), background (black). Note how the face label (red) has been expanded into both the hair and neck regions, as compared to Fig. \ref{['fig:seg_example']}. Note the differences between the labeled triangles (right) and what one would expect to obtain as image labels from the segmentation network (Fig. \ref{['fig:seg_example']} middle).
  • ...and 17 more figures