Table of Contents
Fetching ...

Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Ivan Kapelyukh, Yifei Ren, Ignacio Alzugaray, Edward Johns

TL;DR

Dream2Real tackles zero-shot language-conditioned 3D object rearrangement by marrying 2D vision-language models with a 3D scene representation built from object-centric NeRFs. The robot imagines candidate rearrangements, renders them, and uses a CLIP-based evaluator to score configurations against the user instruction, selecting a physically valid goal pose for execution. The authors introduce distractor filtering via language models, normalising captions to focus on spatial relations, and multi-view aggregation, enabling robust 6-DoF rearrangement in real scenes without task-specific training data. The work demonstrates that 2D VLMs can provide powerful visual priors for 3D manipulation, achieving zero-shot, language-driven rearrangement across tabletop and 3D environments.

Abstract

We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.

Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

TL;DR

Dream2Real tackles zero-shot language-conditioned 3D object rearrangement by marrying 2D vision-language models with a 3D scene representation built from object-centric NeRFs. The robot imagines candidate rearrangements, renders them, and uses a CLIP-based evaluator to score configurations against the user instruction, selecting a physically valid goal pose for execution. The authors introduce distractor filtering via language models, normalising captions to focus on spatial relations, and multi-view aggregation, enabling robust 6-DoF rearrangement in real scenes without task-specific training data. The work demonstrates that 2D VLMs can provide powerful visual priors for 3D manipulation, achieving zero-shot, language-driven rearrangement across tabletop and 3D environments.

Abstract

We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.
Paper Structure (14 sections, 7 figures, 2 tables)

This paper contains 14 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The Dream2Real pipeline. The robot first autonomously builds a model of the scene. Then the user instruction is used to determine which object should be moved, and so the robot can imagine new configurations of the scene and score them using a VLM. Finally, the highest-scoring pose is used as the goal for pick-and-place to complete the rearrangement.
  • Figure 2: The shopping, pool ball, and shelf scenes.
  • Figure 3: Qualitative results from the shopping scene for the tasks "apple in bowl" (top row) and "apple beside bowl" (bottom row). Figure \ref{['fig:all-scenes']} shows the full shopping scene. In the heatmaps (overlaid on the TSDF of the scene), yellow indicates high-scoring positions of the apple, whereas dark blue indicates low-scoring regions, and colliding poses are not included. The red dot highlights the highest-scoring position. The highest-scoring render is shown on the right.
  • Figure 4: Qualitative results from the pool ball scene for the tasks "in triangle" (top row) and "in X shape" (bottom row). The red dot is used to highlight the high-scoring area.
  • Figure 5: Results for the three tasks on the shelf scene, with heatmaps (top row) and the highest-scoring renders (bottom).
  • ...and 2 more figures