Table of Contents
Fetching ...

Dynamic Avatar-Scene Rendering from Human-centric Context

Wenqing Wang, Haosen Yang, Josef Kittler, Xiatian Zhu

TL;DR

This work tackles dynamic human-centric 4D reconstruction from monocular video by addressing boundary artifacts that arise when scene and avatar are modeled separately. It introduces Separate-then-Map (StM), which uses a shared per-attribute transformation to map background Gaussians $\mathcal{G}_b$ and avatar Gaussians $\mathcal{G}_a$ into a unified space, enabling coherent rendering with a $3DGS$ background and an SMPL-guided deformable avatar. The approach carefully designs per-attribute mappings, end-to-end training with depth-aware and image-based losses, and shows state-of-the-art improvements on NeuMan, especially at human-scene boundaries, with only a small per-iteration overhead. This yields robust, photorealistic human-centric renderings suitable for AR/VR, VFX, and digital humans in interacting environments.

Abstract

Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {\bf Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.

Dynamic Avatar-Scene Rendering from Human-centric Context

TL;DR

This work tackles dynamic human-centric 4D reconstruction from monocular video by addressing boundary artifacts that arise when scene and avatar are modeled separately. It introduces Separate-then-Map (StM), which uses a shared per-attribute transformation to map background Gaussians and avatar Gaussians into a unified space, enabling coherent rendering with a background and an SMPL-guided deformable avatar. The approach carefully designs per-attribute mappings, end-to-end training with depth-aware and image-based losses, and shows state-of-the-art improvements on NeuMan, especially at human-scene boundaries, with only a small per-iteration overhead. This yields robust, photorealistic human-centric renderings suitable for AR/VR, VFX, and digital humans in interacting environments.

Abstract

Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {\bf Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.

Paper Structure

This paper contains 20 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Learning from a limited monocular video of a human moving around the scene, existing approaches face several challenges: (a) Holistic 4D reconstruction methods yang2023deformable3dgs struggle to maintain the integrity of the avatar. (b) Separate-based methods kocabas2024hugs suffer from unexpected occlusions and floating artifacts in the regions where the background and the avatar interact. (c) In contrast, our Separate-then-Map (StM) strategy achieves more accurate and complete reconstruction by mapping different model representations to a unified space.
  • Figure 2: Overview of Separate-then-Map (StM): Given an input video sequence, we first initialize the point clouds for the scene and avatar Gaussians using Colmap predictions and SMPL vertex points. This decoupled design is as follows: (a) a 3D Gaussian Splatting (3DGS) model represents the background scene, (b) a deformable 3D Gaussian avatar model driven by linear blend skinning (LBS) to represent the foreground human, with the parameters including position offset $\Delta \mu$, rotation $R$, scale $S$, spherical harmonics (SH) coefficients $C$, opacity $O$, and LBS weight $W$, all predicted from the position triplane feature $\mu$; (c) A information mapping process is then employed to project the scene model and the avatar model into a consistent space. During training, the rendered images and depth maps are used to compute the loss against the ground truth images and monocular estimated depth maps.
  • Figure 3: Qualitative comparison for novel view synthesis comparing our StM with HUGS kocabas2024hugs, D3DGS yang2023deformable3dgs, Vid2Avatar guo2023vid2avatar, and Neuman jiang2022neuman. The zoomed-in regions (red box) highlight the difference.
  • Figure 4: Qualitative evaluation for novel pose synthesis and novel scene composition. Our method generates high-quality results that maintain the fine details and integrity of the avatar, and showing reduced floating artifacts in the background scene.
  • Figure 5: GaussianAvatar cannot reconstruct the entire scene, also rendering inferior avatar.
  • ...and 3 more figures