Table of Contents
Fetching ...

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li

TL;DR

The paper tackles the data scarcity barrier in Vision-Language-Action (VLA) for robotics by introducing RynnVLA-001, a three-stage pretraining curriculum that leverages large-scale ego-centric video generation, followed by trajectory-aware modeling with human keypoints, and finally robot-centric fine-tuning using ActionVAE to compress action sequences. The approach enables transfer of manipulation priors from human demonstrations to robot control, achieving superior finetuned performance over state-of-the-art baselines on LeRobot SO100 across three tasks. Inference is optimized by predicting action embeddings rather than full future frames, enabling faster real-time control. The work demonstrates that a staged curriculum bridging visual dynamics and low-level actions can significantly improve VLA performance in robotics, with ActionVAE providing compact, coherent action representations.

Abstract

This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation

TL;DR

The paper tackles the data scarcity barrier in Vision-Language-Action (VLA) for robotics by introducing RynnVLA-001, a three-stage pretraining curriculum that leverages large-scale ego-centric video generation, followed by trajectory-aware modeling with human keypoints, and finally robot-centric fine-tuning using ActionVAE to compress action sequences. The approach enables transfer of manipulation priors from human demonstrations to robot control, achieving superior finetuned performance over state-of-the-art baselines on LeRobot SO100 across three tasks. Inference is optimized by predicting action embeddings rather than full future frames, enabling faster real-time control. The work demonstrates that a staged curriculum bridging visual dynamics and low-level actions can significantly improve VLA performance in robotics, with ActionVAE providing compact, coherent action representations.

Abstract

This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.

Paper Structure

This paper contains 18 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Training data pipeline of RynnVLA-001. Our framework leverages three types of training data: (1) Ego-Centric Video Generative Pretraining uses millions of ego-centric human manipulation videos for future frame prediction. (2) Human-Centric Trajectory-Aware Video Modeling trains on videos with human keypoint annotations, enabling joint prediction of frames and trajectories. (3) Robot-Centric Vision-Language-Action Modeling employs robot datasets paired with language instructions to learn mappings from visual observations and language to robotic actions.
  • Figure 2: Model architecture and training stages of RynnVLA-001. The training consists of three stages: (1) Ego-Centric Video Generative Pretraining trains a transformer-based Image-to-Video (I2V) model for future frame prediction. (2) Human-Centric Trajectory-Aware Video Modeling extends the I2V model with action (trajectory) prediction heads, incorporating both visual and state embeddings (blue blocks). (3) Robot-Centric Vision-Language-Action Modeling transfers pretrained weights to robot data, where the model generates action embeddings decoded by ActionVAE into executable actions.
  • Figure 3: Illustration of Evaluation Tasks. We evaluate the performance of VLA models on three tasks: (1) pick up and place green blocks, (2) pick up and place strawberries, and (3) grab pen and put it into holder. Each task is evaluated under three settings: single-target manipulation, multi-target manipulation (first three images), and instruction-following with distractors (rightmost image).
  • Figure 4: Visualization of Video Generative Pretraining. Given an input image and a text prompt, an I2V model is trained to predict the next 7 frames. Our pretrained video generation model is capable of generating plausible motions while maintaining the consistency with the input image.
  • Figure 5: Analysis on the front camera's function for coarse localization. (a) Under normal dual-camera settings, the robot successfully picks the strawberries. (b) The front camera is masked, leaving only the wrist camera functional. (c) The robot can still complete the task if the target is within the wrist camera's initial field of view. However, task success rate drops from 80% (4/5) to 0% when the target is outside the wrist camera's view (on the left side), demonstrating that the front camera is essential for guiding the robot to the target's coarse location.
  • ...and 1 more figures