Table of Contents
Fetching ...

DEF-oriCORN: efficient 3D scene understanding for robust language-directed manipulation without demonstrations

Dongwon Son, Sanghyeon Son, Jaehyung Kim, Beomjoon Kim

TL;DR

DEF-oriCORN presents an object-centered representation $\mathbf{s}=(\mathbf{z},\mathbf{c},\mathbf{p})$ with a diffusion-based estimator to enable robust language-directed manipulation from sparse RGB views without demonstrations. It couples a SO(3)-equivariant encoder (FER-VN) and decoders for occupancy, collision, and ray-hitting to form oriCORN, supporting efficient collision checking and language grounding via CLIP. The estimator $f_{den}$ aggregates multi-view features and iteratively refines object states across diffusion timesteps, capturing multimodal uncertainty and enabling robust planning with RRT*-based motion and antipodal-grasp heuristics under uncertainty. Extensive simulations show superior estimation accuracy (AP, CD) and faster planning than baselines, while real-world experiments with transparent and shiny objects demonstrate zero-shot transfer, language grounding efficiency, and a 75% success rate for language-guided tasks. The framework is designed for RGB-only inputs with zero demonstrations and is released with public data, code, and weights to support reproducibility and broader adoption.

Abstract

We present DEF-oriCORN, a framework for language-directed manipulation tasks. By leveraging a novel object-based scene representation and diffusion-model-based state estimation algorithm, our framework enables efficient and robust manipulation planning in response to verbal commands, even in tightly packed environments with sparse camera views without any demonstrations. Unlike traditional representations, our representation affords efficient collision checking and language grounding. Compared to state-of-the-art baselines, our framework achieves superior estimation and motion planning performance from sparse RGB images and zero-shot generalizes to real-world scenarios with diverse materials, including transparent and reflective objects, despite being trained exclusively in simulation. Our code for data generation, training, inference, and pre-trained weights are publicly available at: https://sites.google.com/view/def-oricorn/home.

DEF-oriCORN: efficient 3D scene understanding for robust language-directed manipulation without demonstrations

TL;DR

DEF-oriCORN presents an object-centered representation with a diffusion-based estimator to enable robust language-directed manipulation from sparse RGB views without demonstrations. It couples a SO(3)-equivariant encoder (FER-VN) and decoders for occupancy, collision, and ray-hitting to form oriCORN, supporting efficient collision checking and language grounding via CLIP. The estimator aggregates multi-view features and iteratively refines object states across diffusion timesteps, capturing multimodal uncertainty and enabling robust planning with RRT*-based motion and antipodal-grasp heuristics under uncertainty. Extensive simulations show superior estimation accuracy (AP, CD) and faster planning than baselines, while real-world experiments with transparent and shiny objects demonstrate zero-shot transfer, language grounding efficiency, and a 75% success rate for language-guided tasks. The framework is designed for RGB-only inputs with zero demonstrations and is released with public data, code, and weights to support reproducibility and broader adoption.

Abstract

We present DEF-oriCORN, a framework for language-directed manipulation tasks. By leveraging a novel object-based scene representation and diffusion-model-based state estimation algorithm, our framework enables efficient and robust manipulation planning in response to verbal commands, even in tightly packed environments with sparse camera views without any demonstrations. Unlike traditional representations, our representation affords efficient collision checking and language grounding. Compared to state-of-the-art baselines, our framework achieves superior estimation and motion planning performance from sparse RGB images and zero-shot generalizes to real-world scenarios with diverse materials, including transparent and reflective objects, despite being trained exclusively in simulation. Our code for data generation, training, inference, and pre-trained weights are publicly available at: https://sites.google.com/view/def-oricorn/home.
Paper Structure (27 sections, 4 equations, 17 figures, 6 tables, 3 algorithms)

This paper contains 27 sections, 4 equations, 17 figures, 6 tables, 3 algorithms.

Figures (17)

  • Figure 1: An example problem. We use three RGB cameras, one of which is attached to the end-effector, and the rest are fixed facing the cabinet. Multiple objects including shiny and transparent objects are on the shelf. The robot must ground the language command to objects and plan a collision-free pick-and-place motion to achieve the goal.
  • Figure 2: Three stages of our framework: (i) An architecture for training the oriented shape embedding $z$. Given a mesh, we create its point cloud, $x$, and determine the location of the volumetric centroid, $c\in\mathbb{R}^3$, and $M$ geometrically representative points $\mathbf{p}\in\mathbb{R}^{M \times 3}$. The orange box is the encoder $f_{enc}$, which encodes $x$ into $z$, and the blue boxes are the decoders. The occupancy decoder $f_{occ}$ receives query points and $\mathbf{s}$ to output zero if the query points are inside the object. The ray-hitting decoder $f_{ray}$ receives $\mathbf{s}$, ray starting point, and direction and outputs a binary value indicating whether the ray hit the object. The collision decoder $f_{col}$ predicts the binary value indicating whether the pair of object states are in collision. (ii) In the estimator learning stage, we fix $f_{enc}$ and train the estimation module that predicts $\mathbf{s}$ from a set of $V$ images with intrinsic and extrinsic camera parameters $\{I_i,\xi_i\}_{i=1}^{V}$. (iii) In the test phase, we use estimated $\mathbf{s}$ and the decoders, $f_{col}$, and $f_{ray}$, for collision checking and ray-testing, for motion planning and language grounding.
  • Figure 3: Our diffusion-model-based state estimator for oriCORN. The process begins with a random Gaussian initialization of $K$ number of object states at diffusion timestep $T$, denoted as $\{\hat{\mathbf{s}}_i^T\}_{i=1}^K$. Each $\mathbf{s}$ includes the center positions of objects and geometrically representative points, marked by blue circles and yellow circles respectively in the images. We iteratively refine $\hat{\mathbf{s}}$ by progressing through diffusion timesteps from $T$ to 0, recursively applying the denoising process. First, images are processed using a U-Net architecture ronneberger2015unet (yellow boxes) to extract pixel-wise image features. At each denoising step, image features corresponding to the projected locations of the $\mathbf{p}$ (illustrated as yellow circles), are extracted from the image plane, depicted as orange boxes. $\{\hat{\mathbf{s}}_i^t\}_{i=1}^K$ alongside these image features are then jointly processed to refine object states to $\{\hat{\mathbf{s}}_i^{t-1}\}_{i=1}^K$ for the following timestep $t-1$. Upon the completion of $T$ timesteps, the refined $\{\hat{\mathbf{s}}_i^0\}_{i=1}^K$ are produced as the output.
  • Figure 4: Comparison of different image feature extraction methods from different estimation methods: (a) PARQ and PETR utilize image features from all pixels and use an attention mechanism on the extracted features. (b) RayTran uses voxels to extract image features. It processes images with a 2D CNN, defines a 3D voxel grid on a scene, projects each grid cell to 2D image planes, extracts image features at those locations, and utilizes attention mechanism on these features. The voxel grid is visualized in the figure. (c) DETR3D extracts image features from a single pixel. It iteratively refines object location predictions and uses these predictions to extract relevant image features. The red point indicates a predicted object location during an intermediate iteration, which is then used to get the image feature at that 3D point using projection. (d) Contrary to using a single point for feature extraction, our method gathers image features from a set of geometric representative points $\mathbf{p}$, illustrated by three points.
  • Figure 5: Illustration of pre-processing for achieving invariance. Case 1 and 2 have the same relative transform between two objects, but their global transforms are different. If we treat the two objects as a single composite rigid body, we can assign frames $\{v_1\}$ and $\{v_2\}$ whose origin is at the mid-point of the centers of two objects, and whose direction is determined by the line intersecting the centers. We then apply $T_{v_1w}$ and $T_{v_2w}$ to these frames so that they are at the origin of the world frame $\{w\}$, with their orientation aligned with that of $\{w\}$ as shown in the bottom. This preprocessing step ensures consistent input irrespective of objects' global poses.
  • ...and 12 more figures