Table of Contents
Fetching ...

HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images

Xihe Yang, Xingyu Chen, Daiheng Gao, Shaohui Wang, Xiaoguang Han, Baoyuan Wang

TL;DR

HaveFun tackles animatable human avatar reconstruction from few-shot unconstrained images by introducing a drivable tetrahedral representation built on the Deep Marching Tetrahedra (DMTet) framework, augmented with skinning from SMPLX to handle articulated motion. The method combines two optimization phases: few-shot reference, which constrains body/hand geometry and texture using masks, normals, and depths, and few-shot guidance, which employs a diffusion prior via Score Distillation Sampling (SDS) with Zero123 to synthesize plausible textures for unseen regions, yielding coherent rendering across views and poses. A representation consisting of a mesh M, texture C, skinning weights W, and expression blendshapes E is initialized with a canonical SMPLX prior and refined through losses including $\mathcal{L}_{recon}=\mathcal{L}_{texture}+\mathcal{L}_{normal}+\mathcal{L}_{depth}+\mathcal{L}_{mask}$ for bodies and a lighter version for hands, plus a Laplacian regularizer. The framework is validated on two new benchmarks, FS-XHumans and FS-DART, demonstrating superior rendering quality and robust animation from as few as 2 shots, and showing applicability to real-world casual captures. This work offers a practical path to scalable, low-cost avatar creation with explicit 3D structure and controllable animation, bridging the gap between sparse data regimes and high-fidelity 3D humans.

Abstract

As for human avatar reconstruction, contemporary techniques commonly necessitate the acquisition of costly data and struggle to achieve satisfactory results from a small number of casual images. In this paper, we investigate this task from a few-shot unconstrained photo album. The reconstruction of human avatars from such data sources is challenging because of limited data amount and dynamic articulated poses. For handling dynamic data, we integrate a skinning mechanism with deep marching tetrahedra (DMTet) to form a drivable tetrahedral representation, which drives arbitrary mesh topologies generated by the DMTet for the adaptation of unconstrained images. To effectively mine instructive information from few-shot data, we devise a two-phase optimization method with few-shot reference and few-shot guidance. The former focuses on aligning avatar identity with reference images, while the latter aims to generate plausible appearances for unseen regions. Overall, our framework, called HaveFun, can undertake avatar reconstruction, rendering, and animation. Extensive experiments on our developed benchmarks demonstrate that HaveFun exhibits substantially superior performance in reconstructing the human body and hand. Project website: https://seanchenxy.github.io/HaveFunWeb/.

HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images

TL;DR

HaveFun tackles animatable human avatar reconstruction from few-shot unconstrained images by introducing a drivable tetrahedral representation built on the Deep Marching Tetrahedra (DMTet) framework, augmented with skinning from SMPLX to handle articulated motion. The method combines two optimization phases: few-shot reference, which constrains body/hand geometry and texture using masks, normals, and depths, and few-shot guidance, which employs a diffusion prior via Score Distillation Sampling (SDS) with Zero123 to synthesize plausible textures for unseen regions, yielding coherent rendering across views and poses. A representation consisting of a mesh M, texture C, skinning weights W, and expression blendshapes E is initialized with a canonical SMPLX prior and refined through losses including for bodies and a lighter version for hands, plus a Laplacian regularizer. The framework is validated on two new benchmarks, FS-XHumans and FS-DART, demonstrating superior rendering quality and robust animation from as few as 2 shots, and showing applicability to real-world casual captures. This work offers a practical path to scalable, low-cost avatar creation with explicit 3D structure and controllable animation, bridging the gap between sparse data regimes and high-fidelity 3D humans.

Abstract

As for human avatar reconstruction, contemporary techniques commonly necessitate the acquisition of costly data and struggle to achieve satisfactory results from a small number of casual images. In this paper, we investigate this task from a few-shot unconstrained photo album. The reconstruction of human avatars from such data sources is challenging because of limited data amount and dynamic articulated poses. For handling dynamic data, we integrate a skinning mechanism with deep marching tetrahedra (DMTet) to form a drivable tetrahedral representation, which drives arbitrary mesh topologies generated by the DMTet for the adaptation of unconstrained images. To effectively mine instructive information from few-shot data, we devise a two-phase optimization method with few-shot reference and few-shot guidance. The former focuses on aligning avatar identity with reference images, while the latter aims to generate plausible appearances for unseen regions. Overall, our framework, called HaveFun, can undertake avatar reconstruction, rendering, and animation. Extensive experiments on our developed benchmarks demonstrate that HaveFun exhibits substantially superior performance in reconstructing the human body and hand. Project website: https://seanchenxy.github.io/HaveFunWeb/.
Paper Structure (50 sections, 13 equations, 16 figures, 4 tables)

This paper contains 50 sections, 13 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Overview of HaveFun framework. Based on the DMTet, we design a driveable tetrahedral representation with the skinning mechanism. In terms of optimization, we employ loss functions based on reference-data reconstruction and SDS guidance to create human avatars from few-shot unconstrained images.
  • Figure 2: Training data for ablation studies in Figs. \ref{['fig:multibody']} and \ref{['fig:multihand']}. Blue and green boxes indicate 2- and 4-shot training data, respectively.
  • Figure 3: Ablation studies on the few-shot body reconstruction task. Zoom in to see details.
  • Figure 4: Ablation studies on the few-shot hand reconstruction task. Zoom in to see details.
  • Figure 5: Comparison of hand reconstruction on FS-DART. See suppl. material for training data.
  • ...and 11 more figures