Table of Contents
Fetching ...

VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild

Xin Ming, Yuxuan Han, Tianyu Huang, Feng Xu

TL;DR

This paper addresses the challenge of recovering topologically consistent facial geometry from in-the-wild multi-view images. It leverages VGGT to produce pixel-aligned point maps and camera estimates, then injects topology information via Pixel3DMM UV maps and fuses predictions with a Topology-Aware Bundle Adjustment that adds a local Laplacian regularization. The resulting method, VGGTFace, reconstructs high-quality meshes from 16 views in about 10 seconds on an RTX 4090 and achieves state-of-the-art results on multiple benchmarks while generalizing to real-world data. By integrating 3D foundation models with topology-aware optimization, the work demonstrates a scalable pathway for robust, avatar-grade facial geometry from ordinary captures.

Abstract

Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, i.e. VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.

VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild

TL;DR

This paper addresses the challenge of recovering topologically consistent facial geometry from in-the-wild multi-view images. It leverages VGGT to produce pixel-aligned point maps and camera estimates, then injects topology information via Pixel3DMM UV maps and fuses predictions with a Topology-Aware Bundle Adjustment that adds a local Laplacian regularization. The resulting method, VGGTFace, reconstructs high-quality meshes from 16 views in about 10 seconds on an RTX 4090 and achieves state-of-the-art results on multiple benchmarks while generalizing to real-world data. By integrating 3D foundation models with topology-aware optimization, the work demonstrates a scalable pathway for robust, avatar-grade facial geometry from ordinary captures.

Abstract

Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, i.e. VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.

Paper Structure

This paper contains 28 sections, 7 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: VGGTFace. Given in-the-wild multi-view images (4 of 16 images are shown here) captured by everyday users as input, our method can reconstruct a topologically consistent mesh from these inputs in 10 seconds. Our method demonstrates strong generalization ability across lighting conditions of the capture site, facial expressions, and ethnic groups, opening the door for everyday users to scan themselves into the digital world. By innovatively applying the point map as the facial geometry representation, our method can reconstruct person-specific facial traits, e.g. asymmetric expressions, with high fidelity.
  • Figure 2: The framework of VGGTFace. Given $V$ multi-view images $\{I_i\}_{i=1}^V$ as input, we first augment VGGT with Pixel3DMM to obtain the camera parameters $\{K_i,R_i,t_i\}_{i=1}^V$, point map $\{X_i\}_{i=1}^V$, and UV coordinate image $\{C_i\}_{i=1}^V$ for each view. We then convert these raw predictions to a set of tracks $\{q_{ij}\}$ and an initial point cloud $\{p_j\}_{j=1}^N$ according to the template mesh $\mathcal{M}$. Next, we propose a novel Topology-Aware Bundle Adjustment technique to simultaneously optimize the camera parameters $\{K_i,R_i,t_i\}_{i=1}^V$ and the point cloud $\{p_j\}_{j=1}^N$ to better match the tracks. After that, we connect the point cloud with the topology of $\mathcal{M}$ to obtain a mesh with consistent topology.
  • Figure 3: Qualitative comparison on the H3DS benchmark (the first row) and the NeRSemble benchmark (the last two rows). For each result, we pair it with the error map against the GT scan. Methods marked with * adopt the BFM topology with around 40000 vertices, while Ours and Pixel3DMM follow the FLAME topology with around 5000 and 1500 vertices, respectively.
  • Figure 4: Comparisons on in-the-wild data. For each result, we pair it with the error map against the GT scan with the same colorbar as Figure \ref{['Fig:cmp']}. We report the mean chamfer distance below each method's result.
  • Figure 5: Qualitative evaluation of the key design choices in our method.
  • ...and 7 more figures