Table of Contents
Fetching ...

IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing

Shaofei Wang, Božidar Antić, Andreas Geiger, Siyu Tang

TL;DR

IntrinsicAvatar tackles physically based inverse rendering of dynamic clothed humans from monocular video by disentangling geometry, albedo, material, and lighting. It couples articulated neural radiance fields in a canonical space with volumetric scattering and explicit Monte-Carlo ray tracing, enabling accurate secondary shading, shadows, and indirect illumination, while maintaining pose-relative generalization. A novel articulated secondary ray tracing scheme and a BRDF-augmented volume scattering model provide realistic relighting under unseen poses and lighting without ground-truth material supervision. The approach demonstrates strong quantitative and qualitative gains over state-of-the-art baselines on synthetic and real datasets, highlighting its potential for editable, relightable dynamic human avatars in practical AR/VR pipelines.

Abstract

We present IntrinsicAvatar, a novel approach to recovering the intrinsic properties of clothed human avatars including geometry, albedo, material, and environment lighting from only monocular videos. Recent advancements in human-based neural rendering have enabled high-quality geometry and appearance reconstruction of clothed humans from just monocular videos. However, these methods bake intrinsic properties such as albedo, material, and environment lighting into a single entangled neural representation. On the other hand, only a handful of works tackle the problem of estimating geometry and disentangled appearance properties of clothed humans from monocular videos. They usually achieve limited quality and disentanglement due to approximations of secondary shading effects via learned MLPs. In this work, we propose to model secondary shading effects explicitly via Monte-Carlo ray tracing. We model the rendering process of clothed humans as a volumetric scattering process, and combine ray tracing with body articulation. Our approach can recover high-quality geometry, albedo, material, and lighting properties of clothed humans from a single monocular video, without requiring supervised pre-training using ground truth materials. Furthermore, since we explicitly model the volumetric scattering process and ray tracing, our model naturally generalizes to novel poses, enabling animation of the reconstructed avatar in novel lighting conditions.

IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing

TL;DR

IntrinsicAvatar tackles physically based inverse rendering of dynamic clothed humans from monocular video by disentangling geometry, albedo, material, and lighting. It couples articulated neural radiance fields in a canonical space with volumetric scattering and explicit Monte-Carlo ray tracing, enabling accurate secondary shading, shadows, and indirect illumination, while maintaining pose-relative generalization. A novel articulated secondary ray tracing scheme and a BRDF-augmented volume scattering model provide realistic relighting under unseen poses and lighting without ground-truth material supervision. The approach demonstrates strong quantitative and qualitative gains over state-of-the-art baselines on synthetic and real datasets, highlighting its potential for editable, relightable dynamic human avatars in practical AR/VR pipelines.

Abstract

We present IntrinsicAvatar, a novel approach to recovering the intrinsic properties of clothed human avatars including geometry, albedo, material, and environment lighting from only monocular videos. Recent advancements in human-based neural rendering have enabled high-quality geometry and appearance reconstruction of clothed humans from just monocular videos. However, these methods bake intrinsic properties such as albedo, material, and environment lighting into a single entangled neural representation. On the other hand, only a handful of works tackle the problem of estimating geometry and disentangled appearance properties of clothed humans from monocular videos. They usually achieve limited quality and disentanglement due to approximations of secondary shading effects via learned MLPs. In this work, we propose to model secondary shading effects explicitly via Monte-Carlo ray tracing. We model the rendering process of clothed humans as a volumetric scattering process, and combine ray tracing with body articulation. Our approach can recover high-quality geometry, albedo, material, and lighting properties of clothed humans from a single monocular video, without requiring supervised pre-training using ground truth materials. Furthermore, since we explicitly model the volumetric scattering process and ray tracing, our model naturally generalizes to novel poses, enabling animation of the reconstructed avatar in novel lighting conditions.
Paper Structure (28 sections, 39 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 28 sections, 39 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: IntrinsicAvatar aims to achieve physically based inverse rendering of clothed humans from monocular videos. Left: Our model takes a monocular video as input and learns an avatar of the target person. Middle: We show decomposed properties of the learned avatar. Importantly, our model can produce such decomposition without any data-driven prior on geometry, albedo, or material. Right: With the learned avatar and intrinsic properties, we can animate and relight the avatar using arbitrary pose and arbitrary lighting condition.
  • Figure 2: Inverse Rendering of Clothed Avatars with Volumetric Scattering. Given an input image and associated camera rays, we warp the rays to the canonical space and do both primary and secondary ray marching/tracing in canonical space. We model geometry with a geometry hash grid $\gamma_g$ and MLP $f_g$, while also modeling volumetric radiance and material with an appearance grid $\gamma_c$ and two additional MLPs $f_{rf}$, $f_m$. We supervise both $C_{rf}$ and $C_{pbr}$ using a L1 loss wrt. the input image.
  • Figure 3: Illustration of Volumetric Scattering. Volumetric scattering can blend between multiple surfaces when a ray crosses edges (left). This results in smooth transitions of appearance at boundaries, avoiding noisy shadow (middle) and lighting (right) at these locations.
  • Figure 4: Qualitative comparison to the baseline. We show the results of our method and R4D* on both synthetic (left) and real (middle, right) datasets. As indicated, R4D* struggles to recover intrinsic properties of avatars and do not produce realistic relighting results. Furthermore, it fails to generalize to novel poses. Our method produces high-quality results on both synthetic and real datasets, while generalizing well to novel poses and illuminations. More qualitative results can be found in Appendix \ref{['appx:additional_qualitative_results']}
  • Figure 5: Ablation study. We visualize average visibility (AV) maps of each variant and report albedo PSNR ($\uparrow$)/albedo SSIM ($\uparrow$)/Normal Error ($\downarrow$). Surface scattering with rendered depth results in discontinuities at boundaries and edges. Without our proposed iso-surface search for secondary ray tracing, the visibility map is much darker and does not reflect true visibility. We also refer readers to Fig. \ref{['fig:importance_sampling']} for qualitative relighting results
  • ...and 7 more figures