Table of Contents
Fetching ...

AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting

Aymen Mir, Jian Wang, Riza Alp Guler, Chuan Guo, Gerard Pons-Moll, Bing Zhou

TL;DR

This work addresses the challenge of photorealistic human animation within 3D scenes by leveraging 3D Gaussian Splatting to represent both humans and environments. By decoupling rendering from motion synthesis, it introduces a Gaussian-aligned motion module and a differentiable contact refinement to produce geometry-consistent interactions without requiring paired data. The approach demonstrates strong rendering quality across diverse scenes and enables geometry-consistent free-viewpoint rendering of edited monocular videos, highlighting 3DGS as a viable backbone for scalable, video-native human–scene animation. The findings suggest significant practical impact for monocular video editing, gaming, and CGI, while pointing to future work in lighting, physics-based constraints, and multi-person scenarios.

Abstract

We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation to the problem of animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows for geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that the rendering can be decoupled from the motion synthesis and each sub-problem can be addressed independently, without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion without explicit scene geometry, using opacity-based cues and projected Gaussian structures to guide human placement and pose alignment. To ensure natural interactions, we further propose a human-scene Gaussian refinement optimization that enforces realistic contact and navigation. We evaluate our approach on scenes from Scannet++ and the SuperSplat library, and on avatars reconstructed from sparse and dense multi-view human capture. Finally, we demonstrate that our framework allows for novel applications such as geometry-consistent free-viewpoint rendering of edited monocular RGB videos with new animated humans, showcasing the unique advantage of 3DGS for monocular video-based human animation.

AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting

TL;DR

This work addresses the challenge of photorealistic human animation within 3D scenes by leveraging 3D Gaussian Splatting to represent both humans and environments. By decoupling rendering from motion synthesis, it introduces a Gaussian-aligned motion module and a differentiable contact refinement to produce geometry-consistent interactions without requiring paired data. The approach demonstrates strong rendering quality across diverse scenes and enables geometry-consistent free-viewpoint rendering of edited monocular videos, highlighting 3DGS as a viable backbone for scalable, video-native human–scene animation. The findings suggest significant practical impact for monocular video editing, gaming, and CGI, while pointing to future work in lighting, physics-based constraints, and multi-person scenarios.

Abstract

We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation to the problem of animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows for geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that the rendering can be decoupled from the motion synthesis and each sub-problem can be addressed independently, without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion without explicit scene geometry, using opacity-based cues and projected Gaussian structures to guide human placement and pose alignment. To ensure natural interactions, we further propose a human-scene Gaussian refinement optimization that enforces realistic contact and navigation. We evaluate our approach on scenes from Scannet++ and the SuperSplat library, and on avatars reconstructed from sparse and dense multi-view human capture. Finally, we demonstrate that our framework allows for novel applications such as geometry-consistent free-viewpoint rendering of edited monocular RGB videos with new animated humans, showcasing the unique advantage of 3DGS for monocular video-based human animation.

Paper Structure

This paper contains 10 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 2: Method: Top Left: Using multiview images of the human we reconstruct controllable human Gaussians. Bottom Left: Using scene video we reconstruct static scene Gaussians. Top Middle: We synthesize human motion that confirms with the 3D scene Gaussians using Gaussian-aligned motion module. Top Right: The human poses are used to animate human Gaussians. We further refine these Gaussians for correct placements and contact. Bottom Right: These composited human and scene Gausssians can be rendered from any viewpoint to generate photoreal images of human-scene interaction.
  • Figure 3: Qualitiative results: Our refinement yields consistent contacts across diverse scenes and identities (4--48 camera captures).
  • Figure 4: Free viewpoint rendering of edited monocular video with animated humans