DEF-oriCORN: efficient 3D scene understanding for robust language-directed manipulation without demonstrations
Dongwon Son, Sanghyeon Son, Jaehyung Kim, Beomjoon Kim
TL;DR
DEF-oriCORN presents an object-centered representation $\mathbf{s}=(\mathbf{z},\mathbf{c},\mathbf{p})$ with a diffusion-based estimator to enable robust language-directed manipulation from sparse RGB views without demonstrations. It couples a SO(3)-equivariant encoder (FER-VN) and decoders for occupancy, collision, and ray-hitting to form oriCORN, supporting efficient collision checking and language grounding via CLIP. The estimator $f_{den}$ aggregates multi-view features and iteratively refines object states across diffusion timesteps, capturing multimodal uncertainty and enabling robust planning with RRT*-based motion and antipodal-grasp heuristics under uncertainty. Extensive simulations show superior estimation accuracy (AP, CD) and faster planning than baselines, while real-world experiments with transparent and shiny objects demonstrate zero-shot transfer, language grounding efficiency, and a 75% success rate for language-guided tasks. The framework is designed for RGB-only inputs with zero demonstrations and is released with public data, code, and weights to support reproducibility and broader adoption.
Abstract
We present DEF-oriCORN, a framework for language-directed manipulation tasks. By leveraging a novel object-based scene representation and diffusion-model-based state estimation algorithm, our framework enables efficient and robust manipulation planning in response to verbal commands, even in tightly packed environments with sparse camera views without any demonstrations. Unlike traditional representations, our representation affords efficient collision checking and language grounding. Compared to state-of-the-art baselines, our framework achieves superior estimation and motion planning performance from sparse RGB images and zero-shot generalizes to real-world scenarios with diverse materials, including transparent and reflective objects, despite being trained exclusively in simulation. Our code for data generation, training, inference, and pre-trained weights are publicly available at: https://sites.google.com/view/def-oricorn/home.
