Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation
Jiaming Chen, Wentao Zhao, Ziyu Meng, Donghui Mao, Ran Song, Wei Pan, Wei Zhang
TL;DR
This work addresses the perception gap in Model Predictive Control for robotic manipulation by integrating Vision-Language Models into the MPC loop, enabling open-domain scene understanding and planning from goal images or language. It introduces VLMPC, featuring conditional action sampling guided by VLMs, an action-conditioned video predictor, and a hierarchical cost that blends pixel-level and knowledge-level cues; an enhanced Traj-VLMPC replaces per-step sampling with a 3D GMM trajectory sampler and a voxel-based value map for efficient long-horizon planning. The proposed methods outperform state-of-the-art baselines in both simulation (RoboDesk and Language Table benchmarks) and real-world UR5 experiments, with Traj-VLMPC offering improved stability and speed on long-horizon tasks. The work demonstrates the practical potential of combining foundation models with MPC for flexible, perception-driven robotic manipulation, while acknowledging computational costs and suggesting future work on more powerful world models and integration schemes.
Abstract
Model Predictive Control (MPC) is a widely adopted control paradigm that leverages predictive models to estimate future system states and optimize control inputs accordingly. However, while MPC excels in planning and control, it lacks the capability for environmental perception, leading to failures in complex and unstructured scenarios. To address this limitation, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation planning framework that integrates the perception power of vision-language models (VLMs) with MPC. VLMPC utilizes a conditional action sampling module that takes a goal image or language instruction as input and leverages VLM to generate candidate action sequences. These candidates are fed into a video prediction model that simulates future frames based on the actions. In addition, we propose an enhanced variant, Traj-VLMPC, which replaces video prediction with motion trajectory generation to reduce computational complexity while maintaining accuracy. Traj-VLMPC estimates motion dynamics conditioned on the candidate actions, offering a more efficient alternative for long-horizon tasks and real-time applications. Both VLMPC and Traj-VLMPC select the optimal action sequence using a VLM-based hierarchical cost function that captures both pixel-level and knowledge-level consistency between the current observation and the task input. We demonstrate that both approaches outperform existing state-of-the-art methods on public benchmarks and achieve excellent performance in various real-world robotic manipulation tasks. Code is available at https://github.com/PPjmchen/VLMPC.
