Table of Contents
Fetching ...

See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

Guangyan Chen, Meiling Wang, Qi Shao, Zichen Zhou, Weixin Mao, Te Cui, Minzhao Zhu, Yinan Deng, Luojie Yang, Zhanqi Zhang, Yi Yang, Hua Chen, Yufeng Yue

TL;DR

ViVLA addresses one-shot learning of unseen robotic manipulation tasks by conditioning policy predictions on a single expert video and the robot’s observations. It introduces a latent action tokenizer with action-centric cycle consistency and a parallel decoding VLA framework built on a vision-language backbone, enabling cross-embodiment transfer without fine-tuning. A scalable video-driven data-generation pipeline yields 892,911 expert-agent pairs from human and public datasets, driving robust generalization to unseen tasks, cross-robot transfer, and real-world demonstrations from human videos. Empirically, ViVLA achieves substantial improvements on LIBERO unseen tasks (>30%), cross-embodiment (>35%), and real-world human-video tasks (>38%), demonstrating effective knowledge distillation from demonstrations across embodiments.

Abstract

Developing robust and general-purpose manipulation policies represents a fundamental objective in robotics research. While Vision-Language-Action (VLA) models have demonstrated promising capabilities for end-to-end robot control, existing approaches still exhibit limited generalization to tasks beyond their training distributions. In contrast, humans possess remarkable proficiency in acquiring novel skills by simply observing others performing them once. Inspired by this capability, we propose ViVLA, a generalist robotic manipulation policy that achieves efficient task learning from a single expert demonstration video at test time. Our approach jointly processes an expert demonstration video alongside the robot's visual observations to predict both the demonstrated action sequences and subsequent robot actions, effectively distilling fine-grained manipulation knowledge from expert behavior and transferring it seamlessly to the agent. To enhance the performance of ViVLA, we develop a scalable expert-agent pair data generation pipeline capable of synthesizing paired trajectories from easily accessible human videos, further augmented by curated pairs from publicly available datasets. This pipeline produces a total of 892,911 expert-agent samples for training ViVLA. Experimental results demonstrate that our ViVLA is able to acquire novel manipulation skills from only a single expert demonstration video at test time. Our approach achieves over 30% improvement on unseen LIBERO tasks and maintains above 35% gains with cross-embodiment videos. Real-world experiments demonstrate effective learning from human videos, yielding more than 38% improvement on unseen tasks.

See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

TL;DR

ViVLA addresses one-shot learning of unseen robotic manipulation tasks by conditioning policy predictions on a single expert video and the robot’s observations. It introduces a latent action tokenizer with action-centric cycle consistency and a parallel decoding VLA framework built on a vision-language backbone, enabling cross-embodiment transfer without fine-tuning. A scalable video-driven data-generation pipeline yields 892,911 expert-agent pairs from human and public datasets, driving robust generalization to unseen tasks, cross-robot transfer, and real-world demonstrations from human videos. Empirically, ViVLA achieves substantial improvements on LIBERO unseen tasks (>30%), cross-embodiment (>35%), and real-world human-video tasks (>38%), demonstrating effective knowledge distillation from demonstrations across embodiments.

Abstract

Developing robust and general-purpose manipulation policies represents a fundamental objective in robotics research. While Vision-Language-Action (VLA) models have demonstrated promising capabilities for end-to-end robot control, existing approaches still exhibit limited generalization to tasks beyond their training distributions. In contrast, humans possess remarkable proficiency in acquiring novel skills by simply observing others performing them once. Inspired by this capability, we propose ViVLA, a generalist robotic manipulation policy that achieves efficient task learning from a single expert demonstration video at test time. Our approach jointly processes an expert demonstration video alongside the robot's visual observations to predict both the demonstrated action sequences and subsequent robot actions, effectively distilling fine-grained manipulation knowledge from expert behavior and transferring it seamlessly to the agent. To enhance the performance of ViVLA, we develop a scalable expert-agent pair data generation pipeline capable of synthesizing paired trajectories from easily accessible human videos, further augmented by curated pairs from publicly available datasets. This pipeline produces a total of 892,911 expert-agent samples for training ViVLA. Experimental results demonstrate that our ViVLA is able to acquire novel manipulation skills from only a single expert demonstration video at test time. Our approach achieves over 30% improvement on unseen LIBERO tasks and maintains above 35% gains with cross-embodiment videos. Real-world experiments demonstrate effective learning from human videos, yielding more than 38% improvement on unseen tasks.

Paper Structure

This paper contains 19 sections, 10 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Illustration of our ViVLA. (a) Our ViVLA is trained to predict subsequent robotic actions conditioned on a single expert demonstration video, endowing the model with the capacity to learn novel tasks from a single expert demonstration video at test time. (b) To push the performance limit of our proposed ViVLA, we develop a scalable expert-agent pair data generation pipeline and compile a large-scale expert-agent pair dataset. (c) Extensive experiments demonstrate that our proposed ViVLA efficiently learns unseen tasks and achieves state-of-the-art performance.
  • Figure 2: Overview of our ViVLA. (I) The latent action tokenizer (LAT) learns quantized latent actions from observation sequences, obtaining latent actions for both expert videos and agent demonstrations. (II) The ViVLA model is trained to predict the learned latent action sequences and subsequent robot actions, enabling the robot to acquire novel manipulation skills from only a single expert demonstration video at test time.
  • Figure 3: Motivation for our action-centric cycle consistency. (a) We apply latent actions encoded from reference video frames to the current frame. Existing methods, such as Genie bruce2024genie, generate frames with divergent motion, revealing limited semantic consistency. (b) We visualize latent action spaces across embodiments, revealing limited cross-embodiment alignment in existing methods. Our method addresses these limitations and constructs a unified latent action space.
  • Figure 4: Illustration of our latent action framework with action-centric cycle consistency. Our approach learns latent action representations from observation frames, while simultaneously introducing action-centric cycle-consistency constraints to establish a unified latent action space.
  • Figure 5: Illustration of our ViVLA. Our approach jointly processes an expert demonstration video alongside the robot's visual observations to predict both the demonstrated action sequences and subsequent robot actions, facilitating the ViVLA to distill fine-grained manipulation knowledge from expert behavior and transfer it seamlessly to the agent.
  • ...and 7 more figures