Table of Contents
Fetching ...

Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping

Octavio Arriaga, Proneet Sharma, Jichen Guo, Marc Otto, Siddhant Kadwe, Rebecca Adam

TL;DR

The paper introduces Differentiable Neuro-Graphics, a data-efficient pipeline that reconstructs unseen scenes and enables zero-shot robot grasping using a single RGBD image. It integrates foundation-model segmentation with a physics-based differentiable renderer, and solves constrained optimization problems to infer meshes, lighting, materials, and 6D poses without 3D training data. The approach demonstrates strong zero-shot pose estimation across benchmarks and achieves high success in zero-shot grasping, highlighting improved interpretability and data efficiency over traditional deep-learning pipelines. This work advances autonomous robot performance in novel environments by enabling physically-consistent scene understanding without large labeled datasets or test-time data collection.

Abstract

Operating effectively in novel real-world environments requires robotic systems to estimate and interact with previously unseen objects. Current state-of-the-art models address this challenge by using large amounts of training data and test-time samples to build black-box scene representations. In this work, we introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping without relying on any additional 3D data or test-time samples. Our model solves a series of constrained optimization problems to estimate physically consistent scene parameters, such as meshes, lighting conditions, material properties, and 6D poses of previously unseen objects from a single RGBD image and bounding boxes. We evaluated our approach on standard model-free few-shot benchmarks and demonstrated that it outperforms existing algorithms for model-free few-shot pose estimation. Furthermore, we validated the accuracy of our scene reconstructions by applying our algorithm to a zero-shot grasping task. By enabling zero-shot, physically-consistent scene reconstruction and grasping without reliance on extensive datasets or test-time sampling, our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.

Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping

TL;DR

The paper introduces Differentiable Neuro-Graphics, a data-efficient pipeline that reconstructs unseen scenes and enables zero-shot robot grasping using a single RGBD image. It integrates foundation-model segmentation with a physics-based differentiable renderer, and solves constrained optimization problems to infer meshes, lighting, materials, and 6D poses without 3D training data. The approach demonstrates strong zero-shot pose estimation across benchmarks and achieves high success in zero-shot grasping, highlighting improved interpretability and data efficiency over traditional deep-learning pipelines. This work advances autonomous robot performance in novel environments by enabling physically-consistent scene understanding without large labeled datasets or test-time data collection.

Abstract

Operating effectively in novel real-world environments requires robotic systems to estimate and interact with previously unseen objects. Current state-of-the-art models address this challenge by using large amounts of training data and test-time samples to build black-box scene representations. In this work, we introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping without relying on any additional 3D data or test-time samples. Our model solves a series of constrained optimization problems to estimate physically consistent scene parameters, such as meshes, lighting conditions, material properties, and 6D poses of previously unseen objects from a single RGBD image and bounding boxes. We evaluated our approach on standard model-free few-shot benchmarks and demonstrated that it outperforms existing algorithms for model-free few-shot pose estimation. Furthermore, we validated the accuracy of our scene reconstructions by applying our algorithm to a zero-shot grasping task. By enabling zero-shot, physically-consistent scene reconstruction and grasping without reliance on extensive datasets or test-time sampling, our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.
Paper Structure (15 sections, 17 equations, 12 figures, 3 tables)

This paper contains 15 sections, 17 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: (1) The system first observes the scene with an RGBD camera. (2) The RGB image is segmented using a foundation model and an object detector to obtain object masks, which are then combined with the observations to initialize scene geometry. (3) A physics-based differentiable renderer iteratively refines the system's internal world model, encompassing meshes, lights, and materials, by comparing rendered images to real observations. (4) The optimized 3D scene enables accurate grasping of novel objects.
  • Figure 2: Differentiable Neuro-Graphics for zero-shot scene reconstruction and robotic manipulation. Starting from an $\textrm{RGBD}$ image and bounding box prompts, the system uses the segmentation model SAM to initialize the object masks. It then initializes a 3D scene by performing a robust probabilistic estimation of object shapes using ellipsoidal primitives. Subsequently, a mesh optimization stage refines the mesh vertices through a cage-based deformation model. The resulting scene representation includes meshes, poses, materials, masks, and lighting conditions. Finally, the reconstructed scene is used in simulation to find an optimal grasp, which is then performed in reality using the robotic system.
  • Figure 3: Line equality constraint, constraining the $\mathbb{R}^{3}$ search space to a single dimension $t$ along $d$.
  • Figure 4: Differentiable rendering pipeline. The renderer $\pazocal{R}$ takes scene parameters ${\boldsymbol{\theta}}_{k}$ (left) to output RGBD images (right). The error between rendered and real observations is differentiated to provide gradients for optimizing ${\boldsymbol{\theta}}_{k}$.
  • Figure 5: Zero-shot pose estimation results in $\texttt{FewSOL}$, $\texttt{CLEVR-POSE}$, $\texttt{MOPED}$ and $\texttt{LINEMOD-OCCLUDED}$. All objects are previously unseen by the model.
  • ...and 7 more figures