Table of Contents
Fetching ...

HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video

Hongchi Xia, Chih-Hao Lin, Hao-Yu Hsu, Quentin Leboutet, Katelyn Gao, Michael Paulitsch, Benjamin Ummenhofer, Shenlong Wang

TL;DR

HoloScene tackles the challenge of reconstructing simulation-ready, interactive 3D environments from a single video by unifying geometry, appearance, and physical properties within an interactive scene graph. It formulates scene-graph recovery as a structured energy minimization and solves it with a three-stage inference: gradient initialization, generative sampling with a tree-search for amodal completion, and final gradient refinement. The method achieves complete geometry, physically plausible dynamics, and photorealistic rendering, outperforming state-of-the-art baselines in geometry, physics stability, and object-level reconstruction across multiple indoor datasets. It demonstrates broad applicability in real-time gaming, 3D editing, and immersive experiences, highlighting the potential of simulation-ready digital twins for robotics, AR/VR, and visual effects. Despite strong results, it currently focuses on static indoor scenes, with future work targeting relightable, articulated, and deformable scene components to extend to dynamic outdoors.

Abstract

Digitizing the physical world into accurate simulation-ready virtual environments offers significant opportunities in a variety of fields such as augmented and virtual reality, gaming, and robotics. However, current 3D reconstruction and scene-understanding methods commonly fall short in one or more critical aspects, such as geometry completeness, object interactivity, physical plausibility, photorealistic rendering, or realistic physical properties for reliable dynamic simulation. To address these limitations, we introduce HoloScene, a novel interactive 3D reconstruction framework that simultaneously achieves these requirements. HoloScene leverages a comprehensive interactive scene-graph representation, encoding object geometry, appearance, and physical properties alongside hierarchical and inter-object relationships. Reconstruction is formulated as an energy-based optimization problem, integrating observational data, physical constraints, and generative priors into a unified, coherent objective. Optimization is efficiently performed via a hybrid approach combining sampling-based exploration with gradient-based refinement. The resulting digital twins exhibit complete and precise geometry, physical stability, and realistic rendering from novel viewpoints. Evaluations conducted on multiple benchmark datasets demonstrate superior performance, while practical use-cases in interactive gaming and real-time digital-twin manipulation illustrate HoloScene's broad applicability and effectiveness. Project page: https://xiahongchi.github.io/HoloScene.

HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video

TL;DR

HoloScene tackles the challenge of reconstructing simulation-ready, interactive 3D environments from a single video by unifying geometry, appearance, and physical properties within an interactive scene graph. It formulates scene-graph recovery as a structured energy minimization and solves it with a three-stage inference: gradient initialization, generative sampling with a tree-search for amodal completion, and final gradient refinement. The method achieves complete geometry, physically plausible dynamics, and photorealistic rendering, outperforming state-of-the-art baselines in geometry, physics stability, and object-level reconstruction across multiple indoor datasets. It demonstrates broad applicability in real-time gaming, 3D editing, and immersive experiences, highlighting the potential of simulation-ready digital twins for robotics, AR/VR, and visual effects. Despite strong results, it currently focuses on static indoor scenes, with future work targeting relightable, articulated, and deformable scene components to extend to dynamic outdoors.

Abstract

Digitizing the physical world into accurate simulation-ready virtual environments offers significant opportunities in a variety of fields such as augmented and virtual reality, gaming, and robotics. However, current 3D reconstruction and scene-understanding methods commonly fall short in one or more critical aspects, such as geometry completeness, object interactivity, physical plausibility, photorealistic rendering, or realistic physical properties for reliable dynamic simulation. To address these limitations, we introduce HoloScene, a novel interactive 3D reconstruction framework that simultaneously achieves these requirements. HoloScene leverages a comprehensive interactive scene-graph representation, encoding object geometry, appearance, and physical properties alongside hierarchical and inter-object relationships. Reconstruction is formulated as an energy-based optimization problem, integrating observational data, physical constraints, and generative priors into a unified, coherent objective. Optimization is efficiently performed via a hybrid approach combining sampling-based exploration with gradient-based refinement. The resulting digital twins exhibit complete and precise geometry, physical stability, and realistic rendering from novel viewpoints. Evaluations conducted on multiple benchmark datasets demonstrate superior performance, while practical use-cases in interactive gaming and real-time digital-twin manipulation illustrate HoloScene's broad applicability and effectiveness. Project page: https://xiahongchi.github.io/HoloScene.

Paper Structure

This paper contains 40 sections, 5 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Overview of HoloScene: From a single input video—along with visual cues such as segmentation and monocular depth—HoloScene reconstructs a simulation‑ready, interactive 3D digital twin represented as a scene graph with complete geometry, physically plausible dynamics, and realistic rendering. The resulting model enables a variety of downstream applications, including real‑time interactive gaming, 3D editing, immersive experience capture, and dynamic visual effects.
  • Figure 2: Overview of HoloScene Optimization Stages: Given multiple posed images as well as some visual cues (instance masks, monocular geometry priors), we first employ a gradient-based optimization as the initialization. Then we adopt a generative sampling and tree search strategy along the topology of the scene graph to obtain the complete geometry with physical plausibility. Finally, the final fine-tuning over the scene further enhances the realism of the reconstructed scene.
  • Figure 3: Qualitative Comparisons on Object Geometry and Appearance Reconstruction: Our method delivers superior reconstructions by smoothly inpainting occluded regions with LaMa and completing invisible back-facing geometry with Wonder3D. Unlike baselines, our approach eliminates object interpenetration, ensuring physical stability during simulation.
  • Figure 4: Qualitative Comparisons on Physical Simulation: We compare geometry layouts and appearance before and after physical simulation, with the table geometry reconstructions highlighted in inset figures. HoloScene's complete, non-interpenetrating geometry remains stable in physics simulators, unlike baseline methods. Our Gaussian on mesh delivers high-quality, real-time rendering throughout the simulation process.
  • Figure 5: Dynamic VFX Results. We augment the inferred interactive 3D scene with various visual effects such as dropping objects, adding animations, and fires.
  • ...and 3 more figures