Table of Contents
Fetching ...

METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model

Yankai Fu, Ning Chen, Junkai Zhao, Shaozhe Shan, Guocai Yao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

TL;DR

METIS tackles dexterous manipulation by pretraining a vision-language-action model on EgoAtlas, a multi-source egocentric dataset that aligns human and robotic trajectories under a unified wrist/fingertip action space. It introduces motion-aware dynamics tokens, discretizing visual and hand motion with a VQ-VAE and a residual quantization scheme to enable autoregressive action generation with an extended tokenizer that supports $N=44$ dynamic tokens (4 visual, 40 motion). The model integrates reasoning and acting via chain-of-thought–style prompting with [BOA] and [BOD] tokens, achieving state-of-the-art average success across six real-world tasks and strong generalization to unseen backgrounds, lighting, objects, clutter, and higher-DoF embodiments. These results show that large-scale, diverse egocentric data paired with compact motion priors can substantially improve dexterous, multimodal manipulation in embodied AI.

Abstract

Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.

METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model

TL;DR

METIS tackles dexterous manipulation by pretraining a vision-language-action model on EgoAtlas, a multi-source egocentric dataset that aligns human and robotic trajectories under a unified wrist/fingertip action space. It introduces motion-aware dynamics tokens, discretizing visual and hand motion with a VQ-VAE and a residual quantization scheme to enable autoregressive action generation with an extended tokenizer that supports dynamic tokens (4 visual, 40 motion). The model integrates reasoning and acting via chain-of-thought–style prompting with [BOA] and [BOD] tokens, achieving state-of-the-art average success across six real-world tasks and strong generalization to unseen backgrounds, lighting, objects, clutter, and higher-DoF embodiments. These results show that large-scale, diverse egocentric data paired with compact motion priors can substantially improve dexterous, multimodal manipulation in embodied AI.

Abstract

Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.

Paper Structure

This paper contains 28 sections, 7 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: METIS is trained on a multi-source egocentric manipulation dataset EgoAtlas. It leverages motion-aware dynamics to extract manipulation-relevant dexterous features, and integrates reasoning and acting within a unified framework. METIS achieves strong performance across diverse dexterous manipulation tasks and exhibits remarkable generalization capability.
  • Figure 2: Wearable Hand Motion Collection System.
  • Figure 3: Overview Framework (a) We construct an expressive yet compact representation to capture the dynamics involved in dexterous manipulation. (b) METIS is pretrained on multi-source EgoAtlas dataset, where human and robot actions are align under a unified action space. (c) METIS integrates reasoning and acting whitin a framework, enabling effective deployment to downstream dexterous tasks.
  • Figure 4: Visualization of dexterous manipulation tasks, including three short-horizon three long-horizon tasks.
  • Figure 5: Instruction following results. Each task is collected with 100 demonstrations, jointly trained, and evaluated using different language instructions.
  • ...and 5 more figures