Table of Contents
Fetching ...

Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction

Noé Artru, Rukhshanda Hussain, Emeline Got, Alexandre Messier, David B. Lindell, Abdallah Dib

TL;DR

This work introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass and leverages these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details.

Abstract

Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms. Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.

Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction

TL;DR

This work introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass and leverages these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details.

Abstract

Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms. Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.
Paper Structure (24 sections, 8 equations, 8 figures, 5 tables)

This paper contains 24 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Given sparse multi-view images of a subject, our method predicts geometrically consistent surface normals and leverages them to recover complete and detailed 3D head geometry preserving fine surface features such as wrinkles and skin folds in a matter of seconds.
  • Figure 2: Skullptor reconstructs 3D meshes in two stages. Multi-view normal prediction (Sec. 3.1) produces geometrically consistent surface normals from sparse input images by leveraging cross-view attention across all viewpoints. Mesh optimization (Sec. 3.2) then refines the 3D geometry using the predicted normals as geometric priors within an inverse rendering framework.
  • Figure 3: Qualitative comparison of normal predictions across different methods. Top 2: NPHM; Bottom 2: Multiface
  • Figure 4: Qualitative comparison of mesh reconstruction methods. Top: NPHM; Bottom: Multiface
  • Figure 5: Evolution of mesh reconstruction performance with the number of input camera views (NPHM dataset).
  • ...and 3 more figures