Table of Contents
Fetching ...

Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting

Weiquan Wang, Jun Xiao, Feifei Shao, Yi Yang, Yueting Zhuang, Long Chen

Abstract

Reconstructing dynamic scenes with multiple interacting humans and objects from sparse-view inputs is a critical yet challenging task, essential for creating high-fidelity digital twins for robotics and VR/AR. This problem, which we term Multi-Human Multi-Object (MHMO) rendering, presents two significant obstacles: achieving view-consistent representations for individual instances under severe mutual occlusion, and explicitly modeling the complex and combinatorial dependencies that arise from their interactions. To overcome these challenges, we propose MM-GS, a novel hierarchical framework built upon 3D Gaussian Splatting. Our method first employs a Per-Instance Multi-View Fusion module to establish a robust and consistent representation for each instance by aggregating visual information across all available views. Subsequently, a Scene-Level Instance Interaction module operates on a global scene graph to reason about relationships between all participants, refining their attributes to capture subtle interaction effects. Extensive experiments on challenging datasets demonstrate that our method significantly outperforms strong baselines, producing state-of-the-art results with high-fidelity details and plausible inter-instance contacts.

Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting

Abstract

Reconstructing dynamic scenes with multiple interacting humans and objects from sparse-view inputs is a critical yet challenging task, essential for creating high-fidelity digital twins for robotics and VR/AR. This problem, which we term Multi-Human Multi-Object (MHMO) rendering, presents two significant obstacles: achieving view-consistent representations for individual instances under severe mutual occlusion, and explicitly modeling the complex and combinatorial dependencies that arise from their interactions. To overcome these challenges, we propose MM-GS, a novel hierarchical framework built upon 3D Gaussian Splatting. Our method first employs a Per-Instance Multi-View Fusion module to establish a robust and consistent representation for each instance by aggregating visual information across all available views. Subsequently, a Scene-Level Instance Interaction module operates on a global scene graph to reason about relationships between all participants, refining their attributes to capture subtle interaction effects. Extensive experiments on challenging datasets demonstrate that our method significantly outperforms strong baselines, producing state-of-the-art results with high-fidelity details and plausible inter-instance contacts.

Paper Structure

This paper contains 13 sections, 12 equations, 4 figures, 3 tables.

Figures (4)

  • Figure A1: Core challenges in Multi-Human Multi-Object (MHMO) rendering. From sparse views, rendering complex interactions involves overcoming two key challenges: ensuring cross-view consistency under severe occlusion (top) and modeling the mutual influence between instances at contact regions (bottom). Our MM-GS is designed to address both.
  • Figure B1: Overview of MM-GS pipeline. Our method refines initial 3D Gaussian representations through three main stages. (a) Human-Object Deformation: We initialize the scene by deforming canonical human and object models to their target poses and representing them as collections of 3D Gaussians. (b) Per-Instance Multi-View Fusion: A Cross-View Fusion network refines each instance's appearance and local geometry by aggregating visual features from all its visible viewpoints, ensuring a view-consistent representation. (c) Scene-Level Instance Interaction: Finally, an Instance Interaction network operates on a global scene graph to model the dependencies between all participants, enabling a final refinement to capture interaction-driven effects.
  • Figure E1: Qualitative comparison on the HOI-M$^3$ dataset. We highlight specific regions with colored dashed circles to illustrate the differences. Note that our MM-GS generates significantly sharper details and more plausible contact regions. In contrast, the NeRF-based NeuralHOIFVV-MM tends to produce overly smooth or blurry results, while the 3DGS-based GTU-MM suffers from floating artifacts and geometric inconsistencies.
  • Figure E2: Qualitative results of our ablation study. Removing both modules (w/o Both) leads to blurry results. Adding the View Fusion module (+ View Fusion) significantly improves sharpness. Further incorporating the Interaction network (+ View Fusion + Interaction) resolves ambiguities at contact regions, resulting in cleaner boundaries.