Table of Contents
Fetching ...

PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learning

Chengyang Ying, Zhongkai Hao, Xinning Zhou, Xuezhou Xu, Hang Su, Xingxing Zhang, Jun Zhu

TL;DR

This work addresses the challenge of generalizing reinforcement learning agents across diverse embodiments by proposing Cross-Embodiment Unsupervised RL (CEURL), a reward-free pre-training paradigm formulated as a Controlled Embodiment Markov Decision Process ($CE$-MDP). It introduces Pre-trained Embodiment-Aware Control (PEAC), featuring a cross-embodiment intrinsic reward $\mathcal{R}_{\text{CE}}$ and an embodiment discriminator to learn embodiment-aware, task-agnostic representations, with two flexible variants PEAC-LBS and PEAC-DIAYN that integrate with existing unsupervised RL methods. Theoretical analysis links the pre-training objective to a tractable KL-based form and demonstrates heightened variability of cross-embodiment skill vertices, guiding robust initializations for downstream tasks. Extensive experiments across DeepMind Control Suite, Robosuite, Isaacgym, and real-world Aliengo locomotion show that PEAC enables fast adaptation and generalization to unseen embodiments, highlighting its potential for scalable cross-embodiment control in real-world robotics.

Abstract

Designing generalizable agents capable of adapting to diverse embodiments has achieved significant attention in Reinforcement Learning (RL), which is critical for deploying RL agents in various real-world applications. Previous Cross-Embodiment RL approaches have focused on transferring knowledge across embodiments within specific tasks. These methods often result in knowledge tightly coupled with those tasks and fail to adequately capture the distinct characteristics of different embodiments. To address this limitation, we introduce the notion of Cross-Embodiment Unsupervised RL (CEURL), which leverages unsupervised learning to enable agents to acquire embodiment-aware and task-agnostic knowledge through online interactions within reward-free environments. We formulate CEURL as a novel Controlled Embodiment Markov Decision Process (CE-MDP) and systematically analyze CEURL's pre-training objectives under CE-MDP. Based on these analyses, we develop a novel algorithm Pre-trained Embodiment-Aware Control (PEAC) for handling CEURL, incorporating an intrinsic reward function specifically designed for cross-embodiment pre-training. PEAC not only provides an intuitive optimization strategy for cross-embodiment pre-training but also can integrate flexibly with existing unsupervised RL methods, facilitating cross-embodiment exploration and skill discovery. Extensive experiments in both simulated (e.g., DMC and Robosuite) and real-world environments (e.g., legged locomotion) demonstrate that PEAC significantly improves adaptation performance and cross-embodiment generalization, demonstrating its effectiveness in overcoming the unique challenges of CEURL. The project page and code are in https://yingchengyang.github.io/ceurl.

PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learning

TL;DR

This work addresses the challenge of generalizing reinforcement learning agents across diverse embodiments by proposing Cross-Embodiment Unsupervised RL (CEURL), a reward-free pre-training paradigm formulated as a Controlled Embodiment Markov Decision Process (-MDP). It introduces Pre-trained Embodiment-Aware Control (PEAC), featuring a cross-embodiment intrinsic reward and an embodiment discriminator to learn embodiment-aware, task-agnostic representations, with two flexible variants PEAC-LBS and PEAC-DIAYN that integrate with existing unsupervised RL methods. Theoretical analysis links the pre-training objective to a tractable KL-based form and demonstrates heightened variability of cross-embodiment skill vertices, guiding robust initializations for downstream tasks. Extensive experiments across DeepMind Control Suite, Robosuite, Isaacgym, and real-world Aliengo locomotion show that PEAC enables fast adaptation and generalization to unseen embodiments, highlighting its potential for scalable cross-embodiment control in real-world robotics.

Abstract

Designing generalizable agents capable of adapting to diverse embodiments has achieved significant attention in Reinforcement Learning (RL), which is critical for deploying RL agents in various real-world applications. Previous Cross-Embodiment RL approaches have focused on transferring knowledge across embodiments within specific tasks. These methods often result in knowledge tightly coupled with those tasks and fail to adequately capture the distinct characteristics of different embodiments. To address this limitation, we introduce the notion of Cross-Embodiment Unsupervised RL (CEURL), which leverages unsupervised learning to enable agents to acquire embodiment-aware and task-agnostic knowledge through online interactions within reward-free environments. We formulate CEURL as a novel Controlled Embodiment Markov Decision Process (CE-MDP) and systematically analyze CEURL's pre-training objectives under CE-MDP. Based on these analyses, we develop a novel algorithm Pre-trained Embodiment-Aware Control (PEAC) for handling CEURL, incorporating an intrinsic reward function specifically designed for cross-embodiment pre-training. PEAC not only provides an intuitive optimization strategy for cross-embodiment pre-training but also can integrate flexibly with existing unsupervised RL methods, facilitating cross-embodiment exploration and skill discovery. Extensive experiments in both simulated (e.g., DMC and Robosuite) and real-world environments (e.g., legged locomotion) demonstrate that PEAC significantly improves adaptation performance and cross-embodiment generalization, demonstrating its effectiveness in overcoming the unique challenges of CEURL. The project page and code are in https://yingchengyang.github.io/ceurl.
Paper Structure (57 sections, 1 theorem, 37 equations, 13 figures, 21 tables, 3 algorithms)

This paper contains 57 sections, 1 theorem, 37 equations, 13 figures, 21 tables, 3 algorithms.

Key Result

Theorem 3.2

The pre-training objective Eq. eq_222 of $(\pi,\mathcal{E})$ satisfies

Figures (13)

  • Figure 1: Overview of Cross-Embodiment Unsupervised Reinforcement Learning (CEURL). The left subfigure illustrates the cross-embodiment setting with various possible embodiment changes. Directly training RL agents across embodiments under given tasks may result in task-aware rather than embodiment-aware knowledge. CEURL pre-trains agents in reward-free environments to extract embodiment-aware knowledge. The center subfigure shows the Pre-trained Embodiment-Aware Control (PEAC) algorithm, using our cross-embodiment intrinsic reward function $\mathcal{R}_{\text{CE}}(\tau)$. The right subfigure demonstrates the fine-tuning phase, where pre-trained agents fast adapt to different downstream tasks, improving adaptation and generalization.
  • Figure 1: Results of Robosuite and Isaacgym.
  • Figure 2: Benchmark environments, including DMC tassa2018deepmind, Robosuite zhu2020robosuite, Isaacgym makoviychuk2021isaac.
  • Figure 3: Aggregate metrics agarwal2021deep in state-based DMC. Each statistic for every algorithm has 120 runs (3 embodiment settings $\times$ 4 downstream tasks $\times$ 10 seeds).
  • Figure 4: Aggregate metrics agarwal2021deep in image-based DMC. Each statistic for every algorithm has 36 runs (3 embodiment settings $\times$ 4 downstream tasks $\times$ 3 seeds).
  • ...and 8 more figures

Theorems & Definitions (3)

  • Definition 3.1: Controlled Embodiment MDP (CE-MDP)
  • Theorem 3.2: Proof in Appendix \ref{['app_proof_thm1']}
  • proof