Table of Contents
Fetching ...

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, Jianlan Luo

TL;DR

This work tackles the challenge of long-horizon robotic manipulation by endowing vision-language models with a reflection-based planning loop. It introduces ReflectVLM, which combines a diffusion dynamics model to imagine future states with a reflection mechanism that critiques and revises action sequences at test time, all trained via interactive imitation learning. Across a procedurally generated suite of interlocking-piece assembly tasks, ReflectVLM substantially outperforms zero-shot VLMs and MCTS baselines while maintaining far lower compute than search-based methods. The results demonstrate the value of integrating visual imagination and self-reflection into pre-trained VLMs for physically grounded, multi-step robotic control, with broad potential applicability beyond the tested domain.

Abstract

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs' physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a "reflection" mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.

Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

TL;DR

This work tackles the challenge of long-horizon robotic manipulation by endowing vision-language models with a reflection-based planning loop. It introduces ReflectVLM, which combines a diffusion dynamics model to imagine future states with a reflection mechanism that critiques and revises action sequences at test time, all trained via interactive imitation learning. Across a procedurally generated suite of interlocking-piece assembly tasks, ReflectVLM substantially outperforms zero-shot VLMs and MCTS baselines while maintaining far lower compute than search-based methods. The results demonstrate the value of integrating visual imagination and self-reflection into pre-trained VLMs for physically grounded, multi-step robotic control, with broad potential applicability beyond the tested domain.

Abstract

Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs' physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a "reflection" mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.

Paper Structure

This paper contains 30 sections, 1 equation, 16 figures, 5 tables, 3 algorithms.

Figures (16)

  • Figure 1: Reflective planning. Our method uses a VLM to propose actions and a diffusion dynamics model to imagine the future state of executing the plan. The imagined future helps the VLM reflect the initial plan and propose better action.
  • Figure 2: Training data generation. Training data for the reflection mechanism is collected by relabeling the rollouts. For each timestep, two training examples are generated: (Q1, A1) for action proposal and (Q2, A2) for reflection. $H$ is the imagination horizon, and $h$ is the history length. $a_t^*$ is the action label given by the expert policy.
  • Figure 3: Architecture of Diffusion Dynamics Model, which consists of a latent encoder, text encoder, Diffusion UNet and latent decoder. The latent encoder and text encoder are frozen during training, while Diffusion UNet and latent decoder are finetuned on our task data. $\mathcal{N}$: random noise.
  • Figure 4: Filmstrip of our method solving a complicated assembly task. Frames are indexed by timestep. The goal image is in the top-left corner (with a green border). Each frame is the observation after executing the action (in black) above it. The other action in gray is the original action proposed by the VLM if it is revised after reflection. We highlight the reflection process at timestep 15, where the VLM first proposes an action to pick up the purple brick, but after reflection, it chooses to pick up the yellow brick instead as the generated future state (red-bordered image) shows little progress towards the goal.
  • Figure 5: Task examples. (a) Generated multi-stage manipulation tasks with interlocking pieces. Top: initial configurations. Bottom: goal configurations. See App. \ref{['sec:app_more_task_samples']} for more examples. (b) The graph shows the dependencies between the objects in the blue assembly board on the left. Each node represents an object, and each directed edge indicates the predecessor object should be assembled before the successor object.
  • ...and 11 more figures