Table of Contents
Fetching ...

NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses

Jing Wen, Alexander G. Schwing, Shenlong Wang

TL;DR

NoPo-Avatar addresses animatable avatar reconstruction from sparse inputs without relying on camera or human poses at test-time. It introduces a dual-branch architecture with a template branch (SMPL-X $T$-pose) and image branches, producing Gaussian splats that are merged into a canonical representation, which is then articulated via $\text{LBS}$ and rendered with Gaussian splatting to novel views and poses. The method is trained end-to-end with a composite loss including $L_{\text{mse}}$, $L_{\text{lpips}}$, $L_{\text{chamfer}}$, $L_{\text{proj}}$, and $L_{\text{lbs}}$, enabling accurate detail capture and inpainted unseen regions. Empirically, it outperforms pose-prior baselines under no-pose test-time reconstruction and remains competitive with pose-informed methods in lab settings across THuman2.0, XHuman, and HuGe100K, while offering fast reconstruction relative to per-scene optimization. The work broadens practical applicability of animatable avatars by removing dependence on pose estimation, at the cost of some limitations in hands/face and potential multi-view consistency concerns on synthetic data.

Abstract

We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).

NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses

TL;DR

NoPo-Avatar addresses animatable avatar reconstruction from sparse inputs without relying on camera or human poses at test-time. It introduces a dual-branch architecture with a template branch (SMPL-X -pose) and image branches, producing Gaussian splats that are merged into a canonical representation, which is then articulated via and rendered with Gaussian splatting to novel views and poses. The method is trained end-to-end with a composite loss including , , , , and , enabling accurate detail capture and inpainted unseen regions. Empirically, it outperforms pose-prior baselines under no-pose test-time reconstruction and remains competitive with pose-informed methods in lab settings across THuman2.0, XHuman, and HuGe100K, while offering fast reconstruction relative to per-scene optimization. The work broadens practical applicability of animatable avatars by removing dependence on pose estimation, at the cost of some limitations in hands/face and potential multi-view consistency concerns on synthetic data.

Abstract

We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).

Paper Structure

This paper contains 27 sections, 5 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: (a) Sensitivity to input pose noises. Previous methods kwon2024ghgwen2025lifegom take camera poses and human poses as inputs. We measure their sensitivity to input poses by injecting Gaussian noise of different standard deviations or using a predicted pose. (Averaged over 5 runs for Gaussian noise; std are multiplied by 3 for better visualization.) (b) Comparisons on rendering quality. With the predicted inaccurate input poses, LIFe-GoM cannot produce high-fidelity rendering. In contrast, our methods, which does not take any poses as inputs, produce high-quality rendering.
  • Figure 2: Model architecture of the reconstruction module. The reconstruction module reconstructs the canonical T-pose representation solely from images. It follows the encoder-decoder structure and consists of two types of branches: a template branch and image branches. We show two views of the predictions of each branch: the splatter images in the 2D format and their visualizations in 3D. Gaussians predicted from all branches are combined and fed into the articulation and rendering.
  • Figure 3: Novel view synthesis from sparse input images on THuman2.0. Our approach performs on par with the state-of-the-art in the lab setting (with ground-truth input poses in the reconstruction phase in test-time). Sometimes, ours even captures sharper details. In the real setting (with predicted input poses in the reconstruction in test-time), the rendering quality of GHG and LIFe-GoM is largely decayed. However, our approach without pose priors does not suffer from the bad poses.
  • Figure 4: Comparisons on novel view synthesis from a single image on HuGe100K. Our model details better than IDOL and LHM. Meanwhile, it can also reconstruct the challenging clothes, such as long dresses.
  • Figure 5: Ablation studies.Left: Ablations on the template branch and image branches. Taking a single image as input, template branch only cannot model fine details, such as the prints on the T-shirts (orange boxes). Image branches only miss unseen regions (green boxes). Using both branches offers the best overall quality. Middle: Ablation on $L_\text{proj}$. Without $L_\text{proj}$, only the template Gaussians are effective in the rendering, leading to blurry results. Right: Ablation on $L_\text{lbs}$. Without supervised with the pseudo LBS weights, the image branch fails to reconstruct in the canonical T-pose and to predict the correct LBS weights.
  • ...and 6 more figures