Table of Contents
Fetching ...

Implicit Shape and Appearance Priors for Few-Shot Full Head Reconstruction

Pol Caselles, Eduard Ramon, Jaime Garcia, Gil Triginer, Francesc Moreno-Noguer

TL;DR

This work tackles few-shot full-head reconstruction by introducing a Surface Appearance Statistical Model (SA-SM) that encodes shape and appearance priors, and by modeling geometry as a deformation of a reference Signed Distance Function (SDF) within an implicit differentiable rendering framework. The proposed SIRA++ system combines a pre-trained SA-SM with a two-stage optimization (latent priors first, then deformation/renderer fine-tuning) to achieve accurate, detailed reconstructions from 1–3 images, while significantly reducing runtime via parallel ray tracing and caching. The authors expand the H3DS dataset to 60 high-resolution full-head scans for rigorous evaluation and demonstrate state-of-the-art geometry reconstruction, robustness to camera noise, and substantial speedups (roughly $80\%$ faster) over prior methods. This approach enables reliable, high-fidelity full-head avatars from minimal input, with broad impact for VR/AR, CG, and identity-preserving digital humans, and provides a valuable dataset resource for further research.

Abstract

Recent advancements in learning techniques that employ coordinate-based neural representations have yielded remarkable results in multi-view 3D reconstruction tasks. However, these approaches often require a substantial number of input views (typically several tens) and computationally intensive optimization procedures to achieve their effectiveness. In this paper, we address these limitations specifically for the problem of few-shot full 3D head reconstruction. We accomplish this by incorporating a probabilistic shape and appearance prior into coordinate-based representations, enabling faster convergence and improved generalization when working with only a few input images (even as low as a single image). During testing, we leverage this prior to guide the fitting process of a signed distance function using a differentiable renderer. By incorporating the statistical prior alongside parallelizable ray tracing and dynamic caching strategies, we achieve an efficient and accurate approach to few-shot full 3D head reconstruction. Moreover, we extend the H3DS dataset, which now comprises 60 high-resolution 3D full head scans and their corresponding posed images and masks, which we use for evaluation purposes. By leveraging this dataset, we demonstrate the remarkable capabilities of our approach in achieving state-of-the-art results in geometry reconstruction while being an order of magnitude faster than previous approaches.

Implicit Shape and Appearance Priors for Few-Shot Full Head Reconstruction

TL;DR

This work tackles few-shot full-head reconstruction by introducing a Surface Appearance Statistical Model (SA-SM) that encodes shape and appearance priors, and by modeling geometry as a deformation of a reference Signed Distance Function (SDF) within an implicit differentiable rendering framework. The proposed SIRA++ system combines a pre-trained SA-SM with a two-stage optimization (latent priors first, then deformation/renderer fine-tuning) to achieve accurate, detailed reconstructions from 1–3 images, while significantly reducing runtime via parallel ray tracing and caching. The authors expand the H3DS dataset to 60 high-resolution full-head scans for rigorous evaluation and demonstrate state-of-the-art geometry reconstruction, robustness to camera noise, and substantial speedups (roughly faster) over prior methods. This approach enables reliable, high-fidelity full-head avatars from minimal input, with broad impact for VR/AR, CG, and identity-preserving digital humans, and provides a valuable dataset resource for further research.

Abstract

Recent advancements in learning techniques that employ coordinate-based neural representations have yielded remarkable results in multi-view 3D reconstruction tasks. However, these approaches often require a substantial number of input views (typically several tens) and computationally intensive optimization procedures to achieve their effectiveness. In this paper, we address these limitations specifically for the problem of few-shot full 3D head reconstruction. We accomplish this by incorporating a probabilistic shape and appearance prior into coordinate-based representations, enabling faster convergence and improved generalization when working with only a few input images (even as low as a single image). During testing, we leverage this prior to guide the fitting process of a signed distance function using a differentiable renderer. By incorporating the statistical prior alongside parallelizable ray tracing and dynamic caching strategies, we achieve an efficient and accurate approach to few-shot full 3D head reconstruction. Moreover, we extend the H3DS dataset, which now comprises 60 high-resolution 3D full head scans and their corresponding posed images and masks, which we use for evaluation purposes. By leveraging this dataset, we demonstrate the remarkable capabilities of our approach in achieving state-of-the-art results in geometry reconstruction while being an order of magnitude faster than previous approaches.
Paper Structure (15 sections, 14 equations, 10 figures, 5 tables)

This paper contains 15 sections, 14 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Few-shot full-head reconstruction using SIRA++. Our approach enables high-fidelity 3D head reconstruction using only a few images. The figure showcases two examples, one obtained from a single input image (in 92 seconds) and the other from three input images (in 191 seconds). For each example, we present the input image/s on the left and the corresponding reconstruction on the right, including the 3D mesh, rendered mesh, and normal maps. The results demonstrate the effectiveness and efficiency of our method in generating detailed 3D head avatars with minimal input data.
  • Figure 2: Overview of SIRA++.Left: We construct a surface appearance statistical model using a dataset of raw head scans paired with multiview posed images. This involves learning a codebook of shapes $\mathbf{z}_{\rm sdf}$ and appearances $\mathbf{z}_{\rm rend}$, alongside two decoders that approximate a signed distance function and a renderer. The prior is trained using an autodecoder approach. Right: The pre-trained prior model is integrated with the implicit differentiable renderer. To begin the optimization process with a plausible human head, we sample from the manifold of shape and appearance latents. During the initial iterations, our focus is on training the latents to approximate the closest human head within our statistical model. Subsequently, we unfreeze the deformation and rendering networks, enabling fine-tuning of the fine details. Throughout the entire optimization phase, the reference network remains frozen, ensuring consistent results.
  • Figure 3: Latent shape interpolation. Each row of the figure depicts a latent interpolation between different subjects, controlled by a weight $\alpha$. This interpolation process showcases the smooth and gradual transformation of shapes, reflecting the continuous variation in human head representations along the latent space.
  • Figure 4: Latent shape and appearance interpolation. This figure shows the joint shape and appearance latent interpolation between two subjects (top-left and bottom-right). Shape interpolation of the latent $\mathbf{z}_{sdf}$ is controlled by means of the weight $\alpha$. Appearance interpolation of $\mathbf{z}_r$ is controlled by $\beta$.
  • Figure 5: H3DS Dataset. Three samples from the dataset, each scene composed of 60-100 RGB images, foreground masks, camera parameters, and high-resolution textured 3D meshes capturing the full head, including hair and upper body clothing
  • ...and 5 more figures