Dynamic Avatar-Scene Rendering from Human-centric Context
Wenqing Wang, Haosen Yang, Josef Kittler, Xiatian Zhu
TL;DR
This work tackles dynamic human-centric 4D reconstruction from monocular video by addressing boundary artifacts that arise when scene and avatar are modeled separately. It introduces Separate-then-Map (StM), which uses a shared per-attribute transformation to map background Gaussians $\mathcal{G}_b$ and avatar Gaussians $\mathcal{G}_a$ into a unified space, enabling coherent rendering with a $3DGS$ background and an SMPL-guided deformable avatar. The approach carefully designs per-attribute mappings, end-to-end training with depth-aware and image-based losses, and shows state-of-the-art improvements on NeuMan, especially at human-scene boundaries, with only a small per-iteration overhead. This yields robust, photorealistic human-centric renderings suitable for AR/VR, VFX, and digital humans in interacting environments.
Abstract
Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {\bf Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.
