Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation
Yunhai Feng, Jiaming Han, Zhuoran Yang, Xiangyu Yue, Sergey Levine, Jianlan Luo
TL;DR
This work tackles the challenge of long-horizon robotic manipulation by endowing vision-language models with a reflection-based planning loop. It introduces ReflectVLM, which combines a diffusion dynamics model to imagine future states with a reflection mechanism that critiques and revises action sequences at test time, all trained via interactive imitation learning. Across a procedurally generated suite of interlocking-piece assembly tasks, ReflectVLM substantially outperforms zero-shot VLMs and MCTS baselines while maintaining far lower compute than search-based methods. The results demonstrate the value of integrating visual imagination and self-reflection into pre-trained VLMs for physically grounded, multi-step robotic control, with broad potential applicability beyond the tested domain.
Abstract
Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities, the ability to reason about the physical world, and reactively choose appropriate motor skills. Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems. However, in their current form, VLMs lack both the nuanced understanding of intricate physics required for robotic manipulation and the ability to reason over long horizons to address error compounding issues. In this paper, we introduce a novel test-time computation framework that enhances VLMs' physical reasoning capabilities for multi-stage manipulation tasks. At its core, our approach iteratively improves a pretrained VLM with a "reflection" mechanism - it uses a generative model to imagine future world states, leverages these predictions to guide action selection, and critically reflects on potential suboptimalities to refine its reasoning. Experimental results demonstrate that our method significantly outperforms several state-of-the-art commercial VLMs as well as other post-training approaches such as Monte Carlo Tree Search (MCTS). Videos are available at https://reflect-vlm.github.io.
