Table of Contents
Fetching ...

GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation

Hongyin Zhang, Pengxiang Ding, Shangke Lyu, Ying Peng, Donglin Wang

TL;DR

This work addresses the fragility of vision-language-action models when deployed under external perturbations by introducing GEVRM, a robust VLA framework grounded in internal model control. It combines a text-guided video diffusion planner to generate expressive future goals, prototypical-contrastive state alignment to simulate perturbations, and a goal-guided diffusion policy to produce robust actions. The approach yields state-of-the-art results on the CALVIN benchmark under both standard and perturbed conditions and demonstrates improved real-world task robustness. Overall, GEVRM advances reliable, perturbation-resilient robotic decision-making by integrating expressive goal generation with closed-loop disturbance handling.

Abstract

With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment. These perturbations introduce unforeseen state information to the VLA, resulting in inaccurate actions and consequently, a significant decline in generalization performance. The classic internal model control (IMC) principle demonstrates that a closed-loop system with an internal model that includes external input signals can accurately track the reference input and effectively offset the disturbance. We propose a novel closed-loop VLA method GEVRM that integrates the IMC principle to enhance the robustness of robot visual manipulation. The text-guided video generation model in GEVRM can generate highly expressive future visual planning goals. Simultaneously, we evaluate perturbations by simulating responses, which are called internal embeddings and optimized through prototype contrastive learning. This allows the model to implicitly infer and distinguish perturbations from the external environment. The proposed GEVRM achieves state-of-the-art performance on both standard and perturbed CALVIN benchmarks and shows significant improvements in realistic robot tasks.

GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation

TL;DR

This work addresses the fragility of vision-language-action models when deployed under external perturbations by introducing GEVRM, a robust VLA framework grounded in internal model control. It combines a text-guided video diffusion planner to generate expressive future goals, prototypical-contrastive state alignment to simulate perturbations, and a goal-guided diffusion policy to produce robust actions. The approach yields state-of-the-art results on the CALVIN benchmark under both standard and perturbed conditions and demonstrates improved real-world task robustness. Overall, GEVRM advances reliable, perturbation-resilient robotic decision-making by integrating expressive goal generation with closed-loop disturbance handling.

Abstract

With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment. These perturbations introduce unforeseen state information to the VLA, resulting in inaccurate actions and consequently, a significant decline in generalization performance. The classic internal model control (IMC) principle demonstrates that a closed-loop system with an internal model that includes external input signals can accurately track the reference input and effectively offset the disturbance. We propose a novel closed-loop VLA method GEVRM that integrates the IMC principle to enhance the robustness of robot visual manipulation. The text-guided video generation model in GEVRM can generate highly expressive future visual planning goals. Simultaneously, we evaluate perturbations by simulating responses, which are called internal embeddings and optimized through prototype contrastive learning. This allows the model to implicitly infer and distinguish perturbations from the external environment. The proposed GEVRM achieves state-of-the-art performance on both standard and perturbed CALVIN benchmarks and shows significant improvements in realistic robot tasks.

Paper Structure

This paper contains 16 sections, 7 equations, 15 figures, 10 tables, 1 algorithm.

Figures (15)

  • Figure 1: We are inspired by the classical internal model control (a) in automation systems. The principle illustrates that a closed-loop system equipped with an internal model that accounts for external input signals can precisely follow the reference input and effectively neutralize the perturbations. In this work, an internal model visuomotor control framework (b) is motivated and designed. We leverages a text-guided video model for generating highly expressive visual goal states as reference input, goal-state and current-state internal encoders for modeling responses, and a goal-guided policy for robust action generation.
  • Figure 2: The proposed GEVRM model. First, the T52020t5 model is utilized to encode language instructions, and 2D and 3D VAE are utilized to compress and restore the original pixel space of the robot image state sequence, followed by the DiT module and random mask mechanism to generate the goal image state. Then, through prototypical contrast learning, the current and goal states are aligned to simulate responses and evaluate perturbations. Finally, the goal-guided policy predicts the 7-dimensional robot decision action.
  • Figure 3: Comparison of goal generation on task “put blueberry in pot or pan on stove”.
  • Figure 4: The model is trained only on data collected in environments A, B, and C (a), and tested on environment D (b). Besides, we apply five perturbations to the image observations of environment D to further test the generalization of the model in more challenging scenarios (c).
  • Figure 5: Ablation study on the CALVIN ABC → D. (a) We compare different training paradigms. (b) We examine the impact of different values of the state alignment (SA) hyperparameter $\lambda$.
  • ...and 10 more figures