Table of Contents
Fetching ...

Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

Wei-Cheng Huang, Jiaheng Han, Xiaohan Ye, Zherong Pan, Kris Hauser

TL;DR

This work tackles real-to-sim scene reconstruction in clutter by jointly estimating object shapes and poses under physics constraints. It introduces a structure-aware optimization framework built on the shape-differentiable contact model (SDRS) and a sparsity-exploiting augmented-Lagrangian Hessian to efficiently solve large-scale, contact-rich problems. The method initializes from learning-based priors (SAM3D/FoundationPose), decomposes shapes into convex hulls, and enforces quasistatic equilibrium with frictional contacts, producing simulation-ready configurations validated in MuJoCo. The end-to-end pipeline demonstrates robustness to clutter, achieves physically valid reconstructions, and delivers notable computational speedups via a specialized solver, enabling practical real-to-sim transfer for planning and policy learning.

Abstract

Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end real-to-sim scene estimation pipeline that integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses.

Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

TL;DR

This work tackles real-to-sim scene reconstruction in clutter by jointly estimating object shapes and poses under physics constraints. It introduces a structure-aware optimization framework built on the shape-differentiable contact model (SDRS) and a sparsity-exploiting augmented-Lagrangian Hessian to efficiently solve large-scale, contact-rich problems. The method initializes from learning-based priors (SAM3D/FoundationPose), decomposes shapes into convex hulls, and enforces quasistatic equilibrium with frictional contacts, producing simulation-ready configurations validated in MuJoCo. The end-to-end pipeline demonstrates robustness to clutter, achieves physically valid reconstructions, and delivers notable computational speedups via a specialized solver, enabling practical real-to-sim transfer for planning and policy learning.

Abstract

Estimating simulation-ready scenes from real-world observations is crucial for downstream planning and policy learning tasks. Regretfully, existing methods struggle in cluttered environments, often exhibiting prohibitive computational cost, poor robustness, and restricted generality when scaling to multiple interacting objects. We propose a unified optimization-based formulation for real-to-sim scene estimation that jointly recovers the shapes and poses of multiple rigid objects under physical constraints. Our method is built on two key technical innovations. First, we leverage the recently introduced shape-differentiable contact model, whose global differentiability permits joint optimization over object geometry and pose while modeling inter-object contacts. Second, we exploit the structured sparsity of the augmented Lagrangian Hessian to derive an efficient linear system solver whose computational cost scales favorably with scene complexity. Building on this formulation, we develop an end-to-end real-to-sim scene estimation pipeline that integrates learning-based object initialization, physics-constrained joint shape-pose optimization, and differentiable texture refinement. Experiments on cluttered scenes with up to 5 objects and 22 convex hulls demonstrate that our approach robustly reconstructs physically valid, simulation-ready object shapes and poses.
Paper Structure (28 sections, 25 equations, 13 figures, 5 tables, 3 algorithms)

This paper contains 28 sections, 25 equations, 13 figures, 5 tables, 3 algorithms.

Figures (13)

  • Figure 1: Given a single RGBD image observation of a cluttered scene, we use SAM3D and FoundationPose to derive an initial estimation of object shapes and poses. But these estimates can violate physical constraints and are not simulation ready (red). Our method jointly adjusts shape and pose parameters to enforce physics constraints while minimizing a perceptual loss, leading to simulation ready results (green).
  • Figure 2: We illustrate the three types of objectives in eq:objective, regularizing the distance between convex hull vertex $X_{ijk}$ (red), the SAM3D-identified mesh vertex $p_{il}\in\mathcal{M}_i$ (blue), and the point cloud $p_{il}\in\mathcal{P}_i$ (yellow). Further, we highlight a case (a) where objective value can increase. Suppose our rigid body (light brown) consists of two disjoint convex hulls (b), the closest point to the blue vertex is the bottom surface of the top hull. After an update to hull vertices (c), the two hulls merge and the closest point is moved to the right boundary.
  • Figure 3: Suppose we would like to model a box (light brown) put on a chair (dark brown). The box and the chair are the $i$th and $i'$th rigid bodies respectively, where the box is modeled as a single convex hull and the chair is modeled as the union of 4 convex hulls. Each convex hull is a polytope spanned by a set of vertices $X_{ijk}$ (red). Between the $ij$th convex hull on the box and the $i'j'$th convex hull on the chair modeling the back support, we introduce a separating plane $\left({n},{d}\right)_{iji'j'}$ (blue) as a proxy for the contact model.
  • Figure 4: An illustration of our friction model, between the $ij$th and $i'j'$th convex hull. On each vertex, e.g. $X_{ijk}$, the normal force $f_{ijk,i'j'}^\perp$ is a function of $x$ and $q$ (eq:perp) and the friction force $f_{ijk,i'j'}^\parallel$ is additional decision variables to be optimized. We follow the idea of SDRS contact model and use the separating plane as the proxy for contact modeling and each force $f_{ijk,i'j'}^\parallel$ applied on the $ij$th convex hull is counter-acted by $-f_{ijk,i'j'}^\parallel$ applied on the separating plane $\left({n},{d}\right)_{iji'j'}$, and the case is the same for the $i'j'$th convex hull. We then model the separating plane as a physical object with zero mass, so that all the forces applied on it must be balanced.
  • Figure 5: We illustrate the structure of matrix $H$ (left) and matrix $A$ (right). $A$ is block-diagonal and each block is small. $H$ is factored using the Woodbury matrix identity.
  • ...and 8 more figures