Table of Contents
Fetching ...

Polybot: Training One Policy Across Robots While Embracing Variability

Jonathan Yang, Dorsa Sadigh, Chelsea Finn

TL;DR

Polybot tackles cross-robot generalization in vision-based manipulation by reusing datasets across diverse robotic embodiments. It aligns the observation space with wrist-mounted cameras, the action space via a shared upper-level environment and robot-specific heads, and the internal representations through contrastive pretraining, enabling zero-shot and few-shot transfer across multiple robots. The approach delivers significant gains over baselines, including improved success rates on 6-DoF tasks and robust shelf-manipulation transfer, demonstrating practical potential for reducing data collection effort. This work provides a concrete pathway to scale robotic learning across heterogeneous hardware without sacrificing generality.

Abstract

Reusing large datasets is crucial to scale vision-based robotic manipulators to everyday scenarios due to the high cost of collecting robotic datasets. However, robotic platforms possess varying control schemes, camera viewpoints, kinematic configurations, and end-effector morphologies, posing significant challenges when transferring manipulation skills from one platform to another. To tackle this problem, we propose a set of key design decisions to train a single policy for deployment on multiple robotic platforms. Our framework first aligns the observation and action spaces of our policy across embodiments via utilizing wrist cameras and a unified, but modular codebase. To bridge the remaining domain shift, we align our policy's internal representations across embodiments through contrastive learning. We evaluate our method on a dataset collected over 60 hours spanning 6 tasks and 3 robots with varying joint configurations and sizes: the WidowX 250S, the Franka Emika Panda, and the Sawyer. Our results demonstrate significant improvements in success rate and sample efficiency for our policy when using new task data collected on a different robot, validating our proposed design decisions. More details and videos can be found on our anonymized project website: https://sites.google.com/view/polybot-multirobot

Polybot: Training One Policy Across Robots While Embracing Variability

TL;DR

Polybot tackles cross-robot generalization in vision-based manipulation by reusing datasets across diverse robotic embodiments. It aligns the observation space with wrist-mounted cameras, the action space via a shared upper-level environment and robot-specific heads, and the internal representations through contrastive pretraining, enabling zero-shot and few-shot transfer across multiple robots. The approach delivers significant gains over baselines, including improved success rates on 6-DoF tasks and robust shelf-manipulation transfer, demonstrating practical potential for reducing data collection effort. This work provides a concrete pathway to scale robotic learning across heterogeneous hardware without sacrificing generality.

Abstract

Reusing large datasets is crucial to scale vision-based robotic manipulators to everyday scenarios due to the high cost of collecting robotic datasets. However, robotic platforms possess varying control schemes, camera viewpoints, kinematic configurations, and end-effector morphologies, posing significant challenges when transferring manipulation skills from one platform to another. To tackle this problem, we propose a set of key design decisions to train a single policy for deployment on multiple robotic platforms. Our framework first aligns the observation and action spaces of our policy across embodiments via utilizing wrist cameras and a unified, but modular codebase. To bridge the remaining domain shift, we align our policy's internal representations across embodiments through contrastive learning. We evaluate our method on a dataset collected over 60 hours spanning 6 tasks and 3 robots with varying joint configurations and sizes: the WidowX 250S, the Franka Emika Panda, and the Sawyer. Our results demonstrate significant improvements in success rate and sample efficiency for our policy when using new task data collected on a different robot, validating our proposed design decisions. More details and videos can be found on our anonymized project website: https://sites.google.com/view/polybot-multirobot
Paper Structure (18 sections, 2 equations, 11 figures, 6 tables)

This paper contains 18 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Our framework for generalization across multiple robots. We first standardize our observation space using front-mounted wrist cameras and our action space using a shared higher-level control environment. We then align our policy's internal representations using contrastive learning then finetune these representations to learn robot-specific dynamics.
  • Figure 2: Internal Representation Alignment. This figure depicts two trajectories across different robots for the same task. Our contrastive pretraining approach maps observations with similar proprioceptive state with respect to the grasped book and cabinet together. The orange lines represent example pairs of observations mapped together, while the red line represents an example pair of observations whose embeddings are pushed apart.
  • Figure 3: Pick/Place Tasks. The left column contains the shared pick/place task, while the other columns contain the new distractor and new object variants. For zero-shot evaluation, we include data for the shared task across all robots. For few-shot, we also include $5$ demonstrations of a variant for one robot.
  • Figure 4: Our robotic setups. For each robot, we collect data with both a wrist camera and exterior camera. The cameras are Logitech C920s and Zeds. Although these cameras do have slight differences in brightness and contrast, this does not seem to affect results.
  • Figure 5: Our encoder architecture. We parameterize our encoder as a CNN. The convolutional layers are flattened and then fed into two MLP layers to get a representation $z$. In order to learn correspondence between robots, we train this encoder with a contrastive loss. We use random crop and color jitter as image augmentations for our encoder.
  • ...and 6 more figures