VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

Wentao Zhao; Jiaming Chen; Ziyu Meng; Donghui Mao; Ran Song; Wei Zhang

VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

Wentao Zhao, Jiaming Chen, Ziyu Meng, Donghui Mao, Ran Song, Wei Zhang

TL;DR

VLMPC tackles the perception gap in model predictive control for robotic manipulation by integrating vision-language models into both action sampling and evaluation. It samples candidate action sequences conditioned on goal images or language, predicts their futures with a lightweight action-conditioned video predictor, and selects the best sequence using a hierarchical cost that combines pixel-level alignment with knowledge-level reasoning. The approach shows state-of-the-art performance on simulated benchmarks and demonstrates robust transfer to real-world tasks, including generalization to unseen objects and scenes. Limitations include potential video-prediction mismatches and increased computational cost, motivating future work on more reliable predictors and efficient VLM-MPC integration.

Abstract

Although Model Predictive Control (MPC) can effectively predict the future states of a system and thus is widely used in robotic manipulation tasks, it does not have the capability of environmental perception, leading to the failure in some complex scenarios. To address this issue, we introduce Vision-Language Model Predictive Control (VLMPC), a robotic manipulation framework which takes advantage of the powerful perception capability of vision language model (VLM) and integrates it with MPC. Specifically, we propose a conditional action sampling module which takes as input a goal image or a language instruction and leverages VLM to sample a set of candidate action sequences. Then, a lightweight action-conditioned video prediction model is designed to generate a set of future frames conditioned on the candidate action sequences. VLMPC produces the optimal action sequence with the assistance of VLM through a hierarchical cost function that formulates both pixel-level and knowledge-level consistence between the current observation and the goal image. We demonstrate that VLMPC outperforms the state-of-the-art methods on public benchmarks. More importantly, our method showcases excellent performance in various real-world tasks of robotic manipulation. Code is available at~\url{https://github.com/PPjmchen/VLMPC}.

VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

TL;DR

Abstract

Paper Structure (21 sections, 15 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 15 equations, 8 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Model Predictive Control for Robotic Manipulation
Foundation Models for Robotic Manipulation
Method
Conditional Action Sampling
Action-Conditioned Video Prediction
Hierarchical Cost Function
Pixel Distance Cost
VLM-Assisted Cost
VLM Switcher
Experiments
Implementation Details
Simulation Experiments
Simulation Environments and Experiment Settings
...and 6 more sections

Figures (8)

Figure 1: VLMPC takes as input either a goal image or a language instruction. It first prompts VLMs to generate a conditional sampling distribution, from which action sequences are derived. Then, such action sequences are fed into a lightweight action-conditioned video prediction model to predict a set of future frames. The assessment of VLMPC is performed with a hierarchical cost function composed of two sub-costs: a pixel distance cost and a VLM-assisted cost for performing video assessments based on the future frames. VLMPC finally selects the best action sequence, in which the robot picks the first action to execute and the subsequent actions are fed into the action sampling module to further assist conditional action sampling.
Figure 2: The VLMs subject to a specifically designed prompt $\phi_s$ take as input the current observation $O_t$ and a goal image or a language instruction to generate an end-effector moving direction at coarse level.
Figure 3: Given the past two frames $O_{t}$ and $O_{t-1}$ with the executed actions $a_{t-1}$ and $a_{t}$ corresponding to them and the action $a^{n}_{t+1}$, DMVFN-Act predicts the next frame $\widehat{O}^{n}_{t+1}(a^{n}_{t+1})$. The dashed boxes and arrows indicate the iterative process of taking the actions one by one and predicting the future states frame by frame.
Figure 4: Illustration of the end-effector, the next sub-goal and the interference objects in the current observation. Red, green, and yellow boxes denote the interference objects, the end-effector and the next sub-goal generated by VLMPC.
Figure 5: Quantitative comparison with the $\text{VP}^2$ baseline in the RoboDesk environment.
...and 3 more figures

VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

TL;DR

Abstract

VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)