Table of Contents
Fetching ...

InstantAvatar: Efficient 3D Head Reconstruction via Surface Rendering

Antonio Canela, Pol Caselles, Ibrar Malik, Eduard Ramon, Jaime García, Jordi Sánchez-Riera, Gil Triginer, Francesc Moreno-Noguer

TL;DR

InstantAvatar tackles the slow per-scene optimization of neural-field head reconstructions by introducing a grid-based SDF prior learned from thousands of head shapes and leveraging differentiable surface rendering. A multi-resolution feature grid reduces decoder size and enables fast SDF queries, while a monocular normals cue stabilizes optimization and guides high-frequency detail capture. The approach achieves reconstructive accuracy competitive with state-of-the-art methods but with about a 100× speed-up, enabling near real-time full-head avatars from single or few images. This practical acceleration broadens the applicability of high-fidelity head reconstruction in AR/VR and related applications, without sacrificing qualitative richness in hair, shoulders, and accessories.

Abstract

Recent advances in full-head reconstruction have been obtained by optimizing a neural field through differentiable surface or volume rendering to represent a single scene. While these techniques achieve an unprecedented accuracy, they take several minutes, or even hours, due to the expensive optimization process required. In this work, we introduce InstantAvatar, a method that recovers full-head avatars from few images (down to just one) in a few seconds on commodity hardware. In order to speed up the reconstruction process, we propose a system that combines, for the first time, a voxel-grid neural field representation with a surface renderer. Notably, a naive combination of these two techniques leads to unstable optimizations that do not converge to valid solutions. In order to overcome this limitation, we present a novel statistical model that learns a prior distribution over 3D head signed distance functions using a voxel-grid based architecture. The use of this prior model, in combination with other design choices, results into a system that achieves 3D head reconstructions with comparable accuracy as the state-of-the-art with a 100x speed-up.

InstantAvatar: Efficient 3D Head Reconstruction via Surface Rendering

TL;DR

InstantAvatar tackles the slow per-scene optimization of neural-field head reconstructions by introducing a grid-based SDF prior learned from thousands of head shapes and leveraging differentiable surface rendering. A multi-resolution feature grid reduces decoder size and enables fast SDF queries, while a monocular normals cue stabilizes optimization and guides high-frequency detail capture. The approach achieves reconstructive accuracy competitive with state-of-the-art methods but with about a 100× speed-up, enabling near real-time full-head avatars from single or few images. This practical acceleration broadens the applicability of high-fidelity head reconstruction in AR/VR and related applications, without sacrificing qualitative richness in hair, shoulders, and accessories.

Abstract

Recent advances in full-head reconstruction have been obtained by optimizing a neural field through differentiable surface or volume rendering to represent a single scene. While these techniques achieve an unprecedented accuracy, they take several minutes, or even hours, due to the expensive optimization process required. In this work, we introduce InstantAvatar, a method that recovers full-head avatars from few images (down to just one) in a few seconds on commodity hardware. In order to speed up the reconstruction process, we propose a system that combines, for the first time, a voxel-grid neural field representation with a surface renderer. Notably, a naive combination of these two techniques leads to unstable optimizations that do not converge to valid solutions. In order to overcome this limitation, we present a novel statistical model that learns a prior distribution over 3D head signed distance functions using a voxel-grid based architecture. The use of this prior model, in combination with other design choices, results into a system that achieves 3D head reconstructions with comparable accuracy as the state-of-the-art with a 100x speed-up.
Paper Structure (15 sections, 4 equations, 6 figures, 4 tables)

This paper contains 15 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Reconstruction time comparison. InstantAvatar is a method that obtains full head 3D avatars from one or few images in a matter of seconds. This figure reports the time vs reconstruction error for ours and state-of-the-art methods, when considering only the face region where all methods are applicable (full head error metrics are reported at the experiments section). InstantAvatar speed is only surpassed by 3DMM methods, which, however are significantly less accurate. Compared against other neural field approaches, InstantAvatar obtains a 100$\times$ speed up at similar reconstruction error values.
  • Figure 2: Overview of our method. For each query point $\mathbf{x}$ we obtain the feature $\mathbf{z}(\mathbf{x})$ from the multi-resolution feature grid at different levels of detail. Afterwards, we concatenate the positional encoding applied to $\mathbf{x}$, the global feature $\mathbf{g}_{i}$, and the grid feature $\mathbf{z}(\mathbf{x})$ to query the SDF parameterized by a shallow MLP. We supervise the gradient of the SDF with the predicted normal at the pixel location where the ray intersects. Finally, we use a rendering network to predict the radiance emitted from the surface point $\mathbf{x}_{diff}$, with normal $\mathbf{n}$, in a viewing direction $\mathbf{v}$.
  • Figure 3: Ablation: Qualitative comparison. We conduct an ablation study to qualitatively compare variations of our model using a H3Ds dataset scene in the multi-view setting (6 views). The bottom row zooms into the face region to better appreciate the differences among configurations. Both our final approach and the one without normals supervision outperform the rest of alternatives. However, when normals supervision is not considered the resulting shape tends to be excessively sharp (e.g. the outermost part of the eyebrows) or erroneous (hair). The single grid and the 8-layer MLP (without grid) results are comparable, although they are both unable to capture the high-frequency details obtained with our final model.
  • Figure 4: Ablation: Convergence speed. Convergence speed across epochs averaged over all cases of the H3Ds dataset in the multi-view setting (6 views). Grid-base methods show a faster convergence in contrast to MLPs approaches. The supervision of normals helps convergence in the fine-tuning process at the reconstruction stage.
  • Figure 5: Qualitative results. We compare qualitatively InstantAvatar with other state of the art methods on H3DS dataset for 1 view, 3 views and 6 views. 1 view: SIRA is able to better capture the identity of the subject, however, it takes 10 min of training. DECA on the other hand, can only predict the face region. 3 views: H3D-Net achieves good bias but at a high variance where we can clearly see artifacts on the chin and the hair. 6 views: H3D-Net is able to recover the hair and face regions with similar quality as InstantAvatar.
  • ...and 1 more figures