Table of Contents
Fetching ...

InstantGeoAvatar: Effective Geometry and Appearance Modeling of Animatable Avatars from Monocular Video

Alvaro Budria, Adrian Lopez-Rodriguez, Oscar Lorente, Francesc Moreno-Noguer

TL;DR

InstantGeoAvatar tackles the problem of reconstructing and animating detailed 3D clothed human avatars from monocular RGB video with interactive speed. It introduces a canonical signed distance field $f_{sdf}$ and a texture field $f_{rgb}$ parameterized on a multiresolution hash grid, regulated by a geometry-aware surface term $\mathcal{L}_{smooth}$ integrated into differentiable volume rendering to stabilize hash-grid optimization. Training combines photometric loss, mask loss, Eikonal loss, and the proposed smoothing loss, yielding fast and robust optimization that delivers competitive geometry and novel-view synthesis in as little as $5$–$10$ minutes. The approach enables interactive reconstruction of virtual avatars with improved surface coherence, watertight meshes, and efficient rendering suitable for AR/VR workflows.

Abstract

We present InstantGeoAvatar, a method for efficient and effective learning from monocular video of detailed 3D geometry and appearance of animatable implicit human avatars. Our key observation is that the optimization of a hash grid encoding to represent a signed distance function (SDF) of the human subject is fraught with instabilities and bad local minima. We thus propose a principled geometry-aware SDF regularization scheme that seamlessly fits into the volume rendering pipeline and adds negligible computational overhead. Our regularization scheme significantly outperforms previous approaches for training SDFs on hash grids. We obtain competitive results in geometry reconstruction and novel view synthesis in as little as five minutes of training time, a significant reduction from the several hours required by previous work. InstantGeoAvatar represents a significant leap forward towards achieving interactive reconstruction of virtual avatars.

InstantGeoAvatar: Effective Geometry and Appearance Modeling of Animatable Avatars from Monocular Video

TL;DR

InstantGeoAvatar tackles the problem of reconstructing and animating detailed 3D clothed human avatars from monocular RGB video with interactive speed. It introduces a canonical signed distance field and a texture field parameterized on a multiresolution hash grid, regulated by a geometry-aware surface term integrated into differentiable volume rendering to stabilize hash-grid optimization. Training combines photometric loss, mask loss, Eikonal loss, and the proposed smoothing loss, yielding fast and robust optimization that delivers competitive geometry and novel-view synthesis in as little as minutes. The approach enables interactive reconstruction of virtual avatars with improved surface coherence, watertight meshes, and efficient rendering suitable for AR/VR workflows.

Abstract

We present InstantGeoAvatar, a method for efficient and effective learning from monocular video of detailed 3D geometry and appearance of animatable implicit human avatars. Our key observation is that the optimization of a hash grid encoding to represent a signed distance function (SDF) of the human subject is fraught with instabilities and bad local minima. We thus propose a principled geometry-aware SDF regularization scheme that seamlessly fits into the volume rendering pipeline and adds negligible computational overhead. Our regularization scheme significantly outperforms previous approaches for training SDFs on hash grids. We obtain competitive results in geometry reconstruction and novel view synthesis in as little as five minutes of training time, a significant reduction from the several hours required by previous work. InstantGeoAvatar represents a significant leap forward towards achieving interactive reconstruction of virtual avatars.

Paper Structure

This paper contains 16 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: InstantGeoAvatar. We introduce a system capable of reconstructing the geometry and appearance of animatable human avatars from monocular video in less than 10 minutes. In order to attain high quality geometry reconstructions, we propose a smoothing term directly on the learned signed distance field during optimization, requiring no extra computation or sampling and delivering noticeable qualitative improvements.
  • Figure 2: Non-local updates of the hash grid features. We consider a 1D hash grid encoding segment to illustrate how the proposed regularization affects backpropagation updates. Vanilla Eikonal loss (a) performs backpropagation updates on a single local hash grid cell resulting in discontinuous and spatially disconnected updates. (b) neuralangelo used numerical gradients to distribute backpropagation updates to other cells in the grid, resulting in more spatially coherent learned features. Our proposed smooth surface regularization (c) also distributes backpropagation updates.
  • Figure 3: Neuralangelo's neuralangelo SDF training scheme at longer training regime. Our approach beats Neuralangelo's proposal even after 24 hours of training.
  • Figure 4: Multiscale effect of the proposed loss term.
  • Figure 5: Qualitative comparison of SDF regularization schemes. From left and right, top to bottom: ours, hybrid positional encoding shinobi, curvature loss and finite differences derivatives neuralangelo, and varying weights for Eikonal loss igr.
  • ...and 2 more figures