Table of Contents
Fetching ...

In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data

Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, Xiaolong Wang

TL;DR

The paper tackles robust real-world manipulation by leveraging large-scale egocentric human data through a two-stage training pipeline: pre-training on diverse in-the-wild and on-task data to build a unified, language-grounded representation, followed by post-training on task-aligned demonstrations. It introduces PH$^{S}$D, a large-scale dataset that enables retargeting human demonstrations to humanoid embodiments, and Human$_0$, a language-conditioned flow-matching base model fortified with domain-adversarial learning to bridge embodiment gaps. Empirical results on real Unitsree humanoids demonstrate zero-shot language-following, one-shot robot learning, and improved robustness, with ablations confirming the value of domain adaptation and the data-mix strategy. The work provides a scalable recipe for egocentric manipulation data collection and training, with potential for broader embodiment coverage and practical deployment in industry settings.

Abstract

Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: https://xiongyicai.github.io/In-N-On/

In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data

TL;DR

The paper tackles robust real-world manipulation by leveraging large-scale egocentric human data through a two-stage training pipeline: pre-training on diverse in-the-wild and on-task data to build a unified, language-grounded representation, followed by post-training on task-aligned demonstrations. It introduces PHD, a large-scale dataset that enables retargeting human demonstrations to humanoid embodiments, and Human, a language-conditioned flow-matching base model fortified with domain-adversarial learning to bridge embodiment gaps. Empirical results on real Unitsree humanoids demonstrate zero-shot language-following, one-shot robot learning, and improved robustness, with ablations confirming the value of domain adaptation and the data-mix strategy. The work provides a scalable recipe for egocentric manipulation data collection and training, with potential for broader embodiment coverage and practical deployment in industry settings.

Abstract

Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: https://xiongyicai.github.io/In-N-On/

Paper Structure

This paper contains 21 sections, 6 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Method overview. Our approach follows a two-stage training recipe: (1) pre-training on large-scale in-the-wild human and robot data that are mapped into a unified human-centric state-action space; and (2) on-task post-training using task-aligned human and robot demonstrations. To bridge the embodiment gap, We employ a domain-adversarial discriminator that takes SigLIP visual features and action-state embeddings as input and predicts whether a sample is from human or robot data. Through gradient reversal, this encourages the policy's encoders to produce embodiment-invariant representations, enabling effective transfer between human and robot observations.
  • Figure 2: Our retargeting software suite supports retargeting different humanoids from/to the human-centric representation. Figure demonstrates retargeting from the same human action to different humanoids in MuJoCo todorov2012-mujoco. The code will be released.
  • Figure 3: Data distributions and sampling factors for pre-training and post-training.
  • Figure 4: Confusion matrix obtained by linear probing intermediate features from vanilla model.
  • Figure 5: We task the robot to perform several manipulation tasks to evaluate few-shot learning, language instruction following, and robustness using on-task human data. Videos in the supplementary. (Top to bottom: burger assembly, pouring, and multi-object grasping).
  • ...and 6 more figures