Table of Contents
Fetching ...

Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning

Tianyi Xiang, Jiahang Cao, Sikai Guo, Guoyang Zhao, Andrew F. Luo, Jun Ma

TL;DR

Addressing the perception–control gap in cluttered robotic environments, the paper aims to reconstruct physically valid digital twins from a single RGB-D observation $I_t$ and masks $M_t$. It introduces a physics-constrained Real2Sim pipeline that builds an explicit contact graph and uses a two-stage optimization with a differentiable rigid-body simulator (DiffSDFSim) to jointly refine object poses and physical parameters, culminating in photometric refinement with a differentiable renderer. The approach enforces SDF-based contact constraints, hierarchical physics constraints along a parse tree, and long-horizon zero-velocity priors to ensure stable dynamics under gravity. Across both simulated and real-world pushing tasks, the method achieves high physical stability and realistic contact dynamics while preserving competitive geometric and rendering quality, enabling safer planning and policy learning in cluttered environments.

Abstract

Reconstructing physically valid 3D scenes from single-view observations is a prerequisite for bridging the gap between visual perception and robotic control. However, in scenarios requiring precise contact reasoning, such as robotic manipulation in highly cluttered environments, geometric fidelity alone is insufficient. Standard perception pipelines often neglect physical constraints, resulting in invalid states, e.g., floating objects or severe inter-penetration, rendering downstream simulation unreliable. To address these limitations, we propose a novel physics-constrained Real-to-Sim pipeline that reconstructs physically consistent 3D scenes from single-view RGB-D data. Central to our approach is a differentiable optimization pipeline that explicitly models spatial dependencies via a contact graph, jointly refining object poses and physical properties through differentiable rigid-body simulation. Extensive evaluations in both simulation and real-world settings demonstrate that our reconstructed scenes achieve high physical fidelity and faithfully replicate real-world contact dynamics, enabling stable and reliable contact-rich manipulation.

Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning

TL;DR

Addressing the perception–control gap in cluttered robotic environments, the paper aims to reconstruct physically valid digital twins from a single RGB-D observation and masks . It introduces a physics-constrained Real2Sim pipeline that builds an explicit contact graph and uses a two-stage optimization with a differentiable rigid-body simulator (DiffSDFSim) to jointly refine object poses and physical parameters, culminating in photometric refinement with a differentiable renderer. The approach enforces SDF-based contact constraints, hierarchical physics constraints along a parse tree, and long-horizon zero-velocity priors to ensure stable dynamics under gravity. Across both simulated and real-world pushing tasks, the method achieves high physical stability and realistic contact dynamics while preserving competitive geometric and rendering quality, enabling safer planning and policy learning in cluttered environments.

Abstract

Reconstructing physically valid 3D scenes from single-view observations is a prerequisite for bridging the gap between visual perception and robotic control. However, in scenarios requiring precise contact reasoning, such as robotic manipulation in highly cluttered environments, geometric fidelity alone is insufficient. Standard perception pipelines often neglect physical constraints, resulting in invalid states, e.g., floating objects or severe inter-penetration, rendering downstream simulation unreliable. To address these limitations, we propose a novel physics-constrained Real-to-Sim pipeline that reconstructs physically consistent 3D scenes from single-view RGB-D data. Central to our approach is a differentiable optimization pipeline that explicitly models spatial dependencies via a contact graph, jointly refining object poses and physical properties through differentiable rigid-body simulation. Extensive evaluations in both simulation and real-world settings demonstrate that our reconstructed scenes achieve high physical fidelity and faithfully replicate real-world contact dynamics, enabling stable and reliable contact-rich manipulation.
Paper Structure (18 sections, 7 equations, 4 figures, 3 tables)

This paper contains 18 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Scene-level Real2Sim methods for physical stability. Given a single RGB-D observation and instance masks, we reconstruct the scene and simulate in PyBullet coumans2016pybullet. (a) SAM3D chen2025sam with Iterative Closest Point (ICP) refinement, without geometric and physical constraints, results in interpenetration and floating, leading to unstable rollouts. (b) The geometry-only constrained method avoids penetration and ensures minimum contact, but does not guarantee long-horizon stability. (c) Our physics-constrained method jointly optimizes pose and physical parameters (e.g., friction, mass, and center of mass) to ensure the resulting simulation remains physically stable over time.
  • Figure 2: Overview of our method. Our physics-constrained Real2Sim pipeline consists of four stages. (a) Initial Reconstruction: Given a single RGB-D image $I_t$ and instance masks $M_t$, we obtain an initial estimation of objects geometry and appearance $\theta$ using SAM3D chen2025sam and ICP pose refinement. (b) Contact Graph Construction: We construct a contact graph $cg = (pt, E)$, where parse tree $pt$ represents supporting tree and edges $E$ encode proximal relationships between objects. (c) Two-Stage Physics-Constrained Optimization: Guided by the contact graph, we optimize object properties in two stages. First, a geometry-aware optimization introduces SDF-based contact constraints and visual regularization to globally refine object poses, producing a penetration-free and contact-consistent initialization. Second, a hierarchical physics-constrained optimization, guided by the sequence of parse tree, uses differentiable simulation to jointly refine initial pose and physical parameters of each object for long-horizon physical stability. (d) Photometric Refinement: As a final post-process, object textures are refined using a differentiable renderer to achieve photometric consistency.
  • Figure 3: Qualitative comparisons of physical simulation results with state-of-the-art scene-level reconstruction methods in the simulation environment. We visualize geometry and appearance before and after physical simulation with gravity in PyBullet coumans2016pybullet. Our method produces non-interpenetrating, contact-coherent geometry and achieves long-horizon physical stability compared with baseline methods.
  • Figure 4: Real-world Real2Sim experiment with robot pushing interaction replay. We record the pushing trajectory of a Franka arm in the real world and replay it in the reconstructed digital twin. Using a single-view observation, our method produces a physically consistent scene and better matches the predicted post-interaction than SAM3D+ICP.