Table of Contents
Fetching ...

Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation

Jiaming Chen, Wentao Zhao, Ziyu Meng, Donghui Mao, Ran Song, Wei Pan, Wei Zhang

TL;DR

This work addresses the perception gap in Model Predictive Control for robotic manipulation by integrating Vision-Language Models into the MPC loop, enabling open-domain scene understanding and planning from goal images or language. It introduces VLMPC, featuring conditional action sampling guided by VLMs, an action-conditioned video predictor, and a hierarchical cost that blends pixel-level and knowledge-level cues; an enhanced Traj-VLMPC replaces per-step sampling with a 3D GMM trajectory sampler and a voxel-based value map for efficient long-horizon planning. The proposed methods outperform state-of-the-art baselines in both simulation (RoboDesk and Language Table benchmarks) and real-world UR5 experiments, with Traj-VLMPC offering improved stability and speed on long-horizon tasks. The work demonstrates the practical potential of combining foundation models with MPC for flexible, perception-driven robotic manipulation, while acknowledging computational costs and suggesting future work on more powerful world models and integration schemes.

Abstract

Model Predictive Control (MPC) is a widely adopted control paradigm that leverages predictive models to estimate future system states and optimize control inputs accordingly. However, while MPC excels in planning and control, it lacks the capability for environmental perception, leading to failures in complex and unstructured scenarios. To address this limitation, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation planning framework that integrates the perception power of vision-language models (VLMs) with MPC. VLMPC utilizes a conditional action sampling module that takes a goal image or language instruction as input and leverages VLM to generate candidate action sequences. These candidates are fed into a video prediction model that simulates future frames based on the actions. In addition, we propose an enhanced variant, Traj-VLMPC, which replaces video prediction with motion trajectory generation to reduce computational complexity while maintaining accuracy. Traj-VLMPC estimates motion dynamics conditioned on the candidate actions, offering a more efficient alternative for long-horizon tasks and real-time applications. Both VLMPC and Traj-VLMPC select the optimal action sequence using a VLM-based hierarchical cost function that captures both pixel-level and knowledge-level consistency between the current observation and the task input. We demonstrate that both approaches outperform existing state-of-the-art methods on public benchmarks and achieve excellent performance in various real-world robotic manipulation tasks. Code is available at https://github.com/PPjmchen/VLMPC.

Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation

TL;DR

This work addresses the perception gap in Model Predictive Control for robotic manipulation by integrating Vision-Language Models into the MPC loop, enabling open-domain scene understanding and planning from goal images or language. It introduces VLMPC, featuring conditional action sampling guided by VLMs, an action-conditioned video predictor, and a hierarchical cost that blends pixel-level and knowledge-level cues; an enhanced Traj-VLMPC replaces per-step sampling with a 3D GMM trajectory sampler and a voxel-based value map for efficient long-horizon planning. The proposed methods outperform state-of-the-art baselines in both simulation (RoboDesk and Language Table benchmarks) and real-world UR5 experiments, with Traj-VLMPC offering improved stability and speed on long-horizon tasks. The work demonstrates the practical potential of combining foundation models with MPC for flexible, perception-driven robotic manipulation, while acknowledging computational costs and suggesting future work on more powerful world models and integration schemes.

Abstract

Model Predictive Control (MPC) is a widely adopted control paradigm that leverages predictive models to estimate future system states and optimize control inputs accordingly. However, while MPC excels in planning and control, it lacks the capability for environmental perception, leading to failures in complex and unstructured scenarios. To address this limitation, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation planning framework that integrates the perception power of vision-language models (VLMs) with MPC. VLMPC utilizes a conditional action sampling module that takes a goal image or language instruction as input and leverages VLM to generate candidate action sequences. These candidates are fed into a video prediction model that simulates future frames based on the actions. In addition, we propose an enhanced variant, Traj-VLMPC, which replaces video prediction with motion trajectory generation to reduce computational complexity while maintaining accuracy. Traj-VLMPC estimates motion dynamics conditioned on the candidate actions, offering a more efficient alternative for long-horizon tasks and real-time applications. Both VLMPC and Traj-VLMPC select the optimal action sequence using a VLM-based hierarchical cost function that captures both pixel-level and knowledge-level consistency between the current observation and the task input. We demonstrate that both approaches outperform existing state-of-the-art methods on public benchmarks and achieve excellent performance in various real-world robotic manipulation tasks. Code is available at https://github.com/PPjmchen/VLMPC.

Paper Structure

This paper contains 27 sections, 27 equations, 11 figures, 4 tables, 1 algorithm.

Figures (11)

  • Figure 1: VLMPC takes as input either a goal image or a language instruction. It first prompts VLMs to generate a conditional sampling distribution, from which action sequences are derived. Then, such action sequences are fed into a lightweight action-conditioned video prediction model to predict a set of future frames. The assessment of VLMPC is performed with a hierarchical cost function composed of two sub-costs: a pixel distance cost and a VLM-assisted cost for performing video assessments based on the future frames. VLMPC finally selects the best action sequence, in which the robot picks the first action to execute and the subsequent actions are fed into the action sampling module to further assist conditional action sampling.
  • Figure 2: The VLMs subject to a specifically designed prompt $\phi_s$ take as input the current observation $O_t$ and a goal image or a language instruction to generate an end-effector moving direction at coarse level.
  • Figure 3: Given the past two frames $O_{t}$ and $O_{t-1}$ with the executed actions $a_{t-1}$ and $a_{t}$ corresponding to them and the action $a^{n}_{t+1}$, DMVFN-Act predicts the next frame $\widehat{O}^{n}_{t+1}(a^{n}_{t+1})$. The dashed boxes and arrows indicate the iterative process of taking the actions one by one and predicting the future states frame by frame.
  • Figure 4: Illustration of the end-effector, the next sub-goal and the interference objects in the current observation. Red, green, and yellow boxes denote the interference objects, the end-effector and the next sub-goal generated by VLMPC.
  • Figure 5: Workflow of Traj-VLMPC. Given the end-effector position $P_\text{init}(t_k)$ and the sub-goal $P_\text{end}(t_k)$, a GMM $p(t_k)$ is constructed in 3D space with $M$ kernels. Candidate trajectories are sampled from the GMM and evaluated via the voxel-based 3D value map, with the lowest-cost path executed at each time step in an MPC loop.
  • ...and 6 more figures