Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models
Ivan Kapelyukh, Yifei Ren, Ignacio Alzugaray, Edward Johns
TL;DR
Dream2Real tackles zero-shot language-conditioned 3D object rearrangement by marrying 2D vision-language models with a 3D scene representation built from object-centric NeRFs. The robot imagines candidate rearrangements, renders them, and uses a CLIP-based evaluator to score configurations against the user instruction, selecting a physically valid goal pose for execution. The authors introduce distractor filtering via language models, normalising captions to focus on spatial relations, and multi-view aggregation, enabling robust 6-DoF rearrangement in real scenes without task-specific training data. The work demonstrates that 2D VLMs can provide powerful visual priors for 3D manipulation, achieving zero-shot, language-driven rearrangement across tabletop and 3D environments.
Abstract
We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.
