Table of Contents
Fetching ...

VIP: Vision Instructed Pre-training for Robotic Manipulation

Zhuoling Li, Liangliang Ren, Jinrong Yang, Yong Zhao, Xiaoyang Wu, Zhenhua Xu, Xiang Bai, Hengshuang Zhao

TL;DR

This work tackles the limitation that text-based task prompts struggle to guide robust robotic manipulation when data diversity is limited. It introduces Vision Instructed Pre-training (VIP), which uses vision-based targets comprising the current observation, a cropped future target region, and sparse point flows to supervise action prediction, with flows progressively masked to align training with inference. A Transformer-based policy, VIRT, initialized from DINOv2 and trained on large-scale data, learns to predict action sequences for diverse real and simulated tasks. Empirical results show VIP and VIRT outperform strong baselines and enable complex capabilities, demonstrating a scalable vision-first approach to robotic pre-training and instruction following.

Abstract

The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like ``opening the lid of a tightly sealed bottle''.

VIP: Vision Instructed Pre-training for Robotic Manipulation

TL;DR

This work tackles the limitation that text-based task prompts struggle to guide robust robotic manipulation when data diversity is limited. It introduces Vision Instructed Pre-training (VIP), which uses vision-based targets comprising the current observation, a cropped future target region, and sparse point flows to supervise action prediction, with flows progressively masked to align training with inference. A Transformer-based policy, VIRT, initialized from DINOv2 and trained on large-scale data, learns to predict action sequences for diverse real and simulated tasks. Empirical results show VIP and VIRT outperform strong baselines and enable complex capabilities, demonstrating a scalable vision-first approach to robotic pre-training and instruction following.

Abstract

The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like ``opening the lid of a tightly sealed bottle''.

Paper Structure

This paper contains 12 sections, 1 equation, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Visualization comparison between the action attention maps of the text instructed policy and vision instructed policy. We can observe that the text instructed policy is confused about which region to concentrate on, while the vision instructed policy focuses on the target correctly. This phenomenon suggests that vision instruction is more comprehensible by policy networks.
  • Figure 2: Overall pipeline of VIP. The input to the pre-trained policy includes two image frames (the observation frame and future frame) and sparse point flows, which describe the changing dynamics of the scene. The sparse point flows are gradually removed by the progressive mask module during pre-training.
  • Figure 3: The conceptual diagram of sparse point flow. Consecutive frames in a video comprise numerous pixels and contain much redundant information for describing the movement of a robot hand. By contrast, a small group of points tracking moving pixels, namely sparse point flows, are much more efficient.
  • Figure 4: Visualization of different vision instructions. The three columns of images in the first and second rows show the world model input, future ground truth, and future image prediction in simulated and real scenarios. The third row illustrates the cropped image of the object to manipulate.
  • Figure 5: Illustrations of the Cobot Magic robot and how it is teleoperated. The robot has two master arms and two puppet arms.
  • ...and 4 more figures