Real-to-Sim for Highly Cluttered Environments via Physics-Consistent Inter-Object Reasoning
Tianyi Xiang, Jiahang Cao, Sikai Guo, Guoyang Zhao, Andrew F. Luo, Jun Ma
TL;DR
Addressing the perception–control gap in cluttered robotic environments, the paper aims to reconstruct physically valid digital twins from a single RGB-D observation $I_t$ and masks $M_t$. It introduces a physics-constrained Real2Sim pipeline that builds an explicit contact graph and uses a two-stage optimization with a differentiable rigid-body simulator (DiffSDFSim) to jointly refine object poses and physical parameters, culminating in photometric refinement with a differentiable renderer. The approach enforces SDF-based contact constraints, hierarchical physics constraints along a parse tree, and long-horizon zero-velocity priors to ensure stable dynamics under gravity. Across both simulated and real-world pushing tasks, the method achieves high physical stability and realistic contact dynamics while preserving competitive geometric and rendering quality, enabling safer planning and policy learning in cluttered environments.
Abstract
Reconstructing physically valid 3D scenes from single-view observations is a prerequisite for bridging the gap between visual perception and robotic control. However, in scenarios requiring precise contact reasoning, such as robotic manipulation in highly cluttered environments, geometric fidelity alone is insufficient. Standard perception pipelines often neglect physical constraints, resulting in invalid states, e.g., floating objects or severe inter-penetration, rendering downstream simulation unreliable. To address these limitations, we propose a novel physics-constrained Real-to-Sim pipeline that reconstructs physically consistent 3D scenes from single-view RGB-D data. Central to our approach is a differentiable optimization pipeline that explicitly models spatial dependencies via a contact graph, jointly refining object poses and physical properties through differentiable rigid-body simulation. Extensive evaluations in both simulation and real-world settings demonstrate that our reconstructed scenes achieve high physical fidelity and faithfully replicate real-world contact dynamics, enabling stable and reliable contact-rich manipulation.
