Table of Contents
Fetching ...

Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Lirui Wang, Xinlei Chen, Jialiang Zhao, Kaiming He

TL;DR

Robotic policies suffer from hardware and environment heterogeneity, making cross-embodiment generalization difficult. The authors introduce Heterogeneous Pre-trained Transformers (HPT), which uses embodiment-specific stems, a shared trunk, and task-specific heads to learn a universal latent representation from real robots, simulators, and human videos, enabling transfer to unseen embodiments with minimal new data. Large-scale pre-training across 50+ heterogeneous datasets yields scalable improvements, with fine-tuned policies outperforming baselines by significant margins on unseen tasks in both simulation and real-world settings. This work advances scalable robotic foundation models by demonstrating cross-embodiment transfer and providing open-source code and weights for community use.

Abstract

One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (https://liruiw.github.io/hpt/) for code and videos.

Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

TL;DR

Robotic policies suffer from hardware and environment heterogeneity, making cross-embodiment generalization difficult. The authors introduce Heterogeneous Pre-trained Transformers (HPT), which uses embodiment-specific stems, a shared trunk, and task-specific heads to learn a universal latent representation from real robots, simulators, and human videos, enabling transfer to unseen embodiments with minimal new data. Large-scale pre-training across 50+ heterogeneous datasets yields scalable improvements, with fine-tuned policies outperforming baselines by significant margins on unseen tasks in both simulation and real-world settings. This work advances scalable robotic foundation models by demonstrating cross-embodiment transfer and providing open-source code and weights for community use.

Abstract

One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (https://liruiw.github.io/hpt/) for code and videos.
Paper Structure (42 sections, 1 equation, 19 figures, 5 tables)

This paper contains 42 sections, 1 equation, 19 figures, 5 tables.

Figures (19)

  • Figure 1: The Heterogeneous Pre-training concept. It maps different embodiments, each with its own proprioception and vision sensors, onto a shared latent space by embodiment-specific tokenizers ("stems"). This aligns the heterogeneous data from different embodiments into a joint representation space. This allows us to train a shared Transformer trunk on the union of all heterogeneous datasets. The pre-trained Transformer can be transferred to a new embodiment, with a small, new tokenizer learned at transferring time.
  • Figure 2: HPT architecture. HPT is modularized into stems, trunk, and heads. The stem, consisting of a proprioception tokenizer and a vision tokenizer, maps the vision and proprioception observations of different embodiments to a fixed number (e.g. 16) of tokens. The shared trunk, which is a Transformer, maps the concatenated tokens into shared representations. The head then maps the processed tokens to actions in different downstream tasks. For a specific embodiment, one stem/head pair is activated (denoted by the switch). The trunk is shared and pre-trained on action-labeled data with supervised learning and then transferred to new embodiments. This procedure scales up to 52 datasets and 1B parameters.
  • Figure 3: Stem Architecture in HPT. In the HPT stem, the proprioceptive tokenizer uses an MLP to map proprioceptive information to a feature which is then attended by 16 learnable tokens. The vision tokenizer uses pre-trained encoders and similarly uses an attention mechanism to map vision features into 16 fixed tokens. The architecture flexibly handles sequences of inputs without increasing the size of tokens.
  • Figure 4: Dataset Heterogeneity in Robotics. We show illustrations of dataset mixtures (each color is a distinct embodiment) from different domains including real robot teleop open_x_embodiment_rt_x_2023, deployed robots frodobots2024frodobots2k, simulations, and human videos Damen2018EPICKITCHENS. See Appendix Section \ref{['appendix:impl']} for dataset mixture details.
  • Figure 5: Data Scaling. We run scaling HPT experiments along dataset sizes and the number of datasets. Each point is the validation loss of a full training run. (a) We evaluate the losses on 27 datasets with the number of total trajectories ranging from a maximum of 10 trajectories per dataset (270 in total) to a maximum of 100000 trajectories per dataset (170k in total). We compare two model sizes, HPT-S/L, where HPT-L is a bigger model trained with 4 times more tokens than HPT-S. (b) We compute the validation losses for a fixed subset of 10 datasets with a fixed number of epochs (2). We compute mean and standard deviations for 4 runs across model sizes from HPT-S to HPT-XL and across dataset counts from 10 to 52.
  • ...and 14 more figures