VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild
Xin Ming, Yuxuan Han, Tianyu Huang, Feng Xu
TL;DR
This paper addresses the challenge of recovering topologically consistent facial geometry from in-the-wild multi-view images. It leverages VGGT to produce pixel-aligned point maps and camera estimates, then injects topology information via Pixel3DMM UV maps and fuses predictions with a Topology-Aware Bundle Adjustment that adds a local Laplacian regularization. The resulting method, VGGTFace, reconstructs high-quality meshes from 16 views in about 10 seconds on an RTX 4090 and achieves state-of-the-art results on multiple benchmarks while generalizing to real-world data. By integrating 3D foundation models with topology-aware optimization, the work demonstrates a scalable pathway for robust, avatar-grade facial geometry from ordinary captures.
Abstract
Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, i.e. VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.
