Table of Contents
Fetching ...

Hardware Conditioned Policies for Multi-Robot Transfer Learning

Tao Chen, Adithyavairavan Murali, Abhinav Gupta

TL;DR

The paper tackles cross-robot transfer learning for DRL policies by introducing Hardware Conditioned Policies, which condition policy decisions on a hardware vector. It presents two encoding schemes: explicit kinematic encoding (HCP-E) and implicit learned embeddings (HCP-I), enabling zero-shot transfer and sample-efficient fine-tuning across diverse robots and tasks. Empirical results in manipulation and locomotion show superior transfer performance, with HCP-I matching or exceeding HCP-E in environments with unknown dynamics. The approach promises practical scalability for deploying unified policies across heterogeneous robotic platforms, with public code and demonstrations available.

Abstract

Deep reinforcement learning could be used to learn dexterous robotic policies but it is challenging to transfer them to new robots with vastly different hardware properties. It is also prohibitively expensive to learn a new policy from scratch for each robot hardware due to the high sample complexity of modern state-of-the-art algorithms. We propose a novel approach called \textit{Hardware Conditioned Policies} where we train a universal policy conditioned on a vector representation of robot hardware. We considered robots in simulation with varied dynamics, kinematic structure, kinematic lengths and degrees-of-freedom. First, we use the kinematic structure directly as the hardware encoding and show great zero-shot transfer to completely novel robots not seen during training. For robots with lower zero-shot success rate, we also demonstrate that fine-tuning the policy network is significantly more sample-efficient than training a model from scratch. In tasks where knowing the agent dynamics is important for success, we learn an embedding for robot hardware and show that policies conditioned on the encoding of hardware tend to generalize and transfer well. The code and videos are available on the project webpage: https://sites.google.com/view/robot-transfer-hcp.

Hardware Conditioned Policies for Multi-Robot Transfer Learning

TL;DR

The paper tackles cross-robot transfer learning for DRL policies by introducing Hardware Conditioned Policies, which condition policy decisions on a hardware vector. It presents two encoding schemes: explicit kinematic encoding (HCP-E) and implicit learned embeddings (HCP-I), enabling zero-shot transfer and sample-efficient fine-tuning across diverse robots and tasks. Empirical results in manipulation and locomotion show superior transfer performance, with HCP-I matching or exceeding HCP-E in environments with unknown dynamics. The approach promises practical scalability for deploying unified policies across heterogeneous robotic platforms, with public code and demonstrations available.

Abstract

Deep reinforcement learning could be used to learn dexterous robotic policies but it is challenging to transfer them to new robots with vastly different hardware properties. It is also prohibitively expensive to learn a new policy from scratch for each robot hardware due to the high sample complexity of modern state-of-the-art algorithms. We propose a novel approach called \textit{Hardware Conditioned Policies} where we train a universal policy conditioned on a vector representation of robot hardware. We considered robots in simulation with varied dynamics, kinematic structure, kinematic lengths and degrees-of-freedom. First, we use the kinematic structure directly as the hardware encoding and show great zero-shot transfer to completely novel robots not seen during training. For robots with lower zero-shot success rate, we also demonstrate that fine-tuning the policy network is significantly more sample-efficient than training a model from scratch. In tasks where knowing the agent dynamics is important for success, we learn an embedding for robot hardware and show that policies conditioned on the encoding of hardware tend to generalize and transfer well. The code and videos are available on the project webpage: https://sites.google.com/view/robot-transfer-hcp.

Paper Structure

This paper contains 25 sections, 1 equation, 13 figures, 14 tables, 3 algorithms.

Figures (13)

  • Figure 1: Local coordinate systems for two consecutive joints
  • Figure 2: Robots with different DOF and kinematics structures. The white rings represent joints. There are $4$ variants of 5 and 6 DOF robots due to the different placements of joints.
  • Figure 3: Learning curves for multi-DOF setup. Training robots contain Type A-G and Type I robots (four 5-DOF types, three 6-DOF types, one 7-DOF type). Each type has 140 variants with different dynamics and link lengths. The 100 testing robots used to generate the learning curves are from the same training robot types but with different link lengths and dynamics. (a): reacher task with random initial pose and target position. (b): peg insertion with fixed hole position. (c): peg insertion with hole position $(x, y, z)$ randomly sampled in a $0.2m$ box region. Notice that the converged success rate in (c) is only about $70\%$. This is because when we randomly generate the hole position, some robots cannot actually insert the peg into hole due to physical limit. Some hole positions are not inside the reachable space (workspace) of the robots. This is especially common in 5-DOF robots.
  • Figure 4: Testing distance distribution on a real sawyer robot. A used the policy from Exp. i, B used the policy from Exp. ii, C used the policy trained with the actual Sawyer CAD model in simulation with randomized dynamics.
  • Figure 5: (a): Distribution (violin plots) of distance between the peg bottom at the end of episode and the desired position. The three horizontal lines in each violin plot stand for the lower extrema, median value, and the higher extrema. It clearly shows that HCP-E moves the pegs much closer to the hole than DDPG+HER. (b): The brown curve is the learning curve of training HCP-E on robot type H with different link lengths and dynamics in multi-goal setup from scratch. The pink curve is the learning curve of training HCP-E on same robots with pretrained model from Exp. xi. (c): Similar to (b), the training robots are robot type I (7 DOF) and the pretrained model is from Exp. xv. (b) and (c) show that applying the pretrained model that is trained on different robot types to a new robot type can accelerate the learning by a large margin.
  • ...and 8 more figures