Table of Contents
Fetching ...

Cross-domain Random Pre-training with Prototypes for Reinforcement Learning

Xin Liu, Yaran Chen, Haoran Li, Boyu Li, Dongbin Zhao

TL;DR

CRPTpro tackles the challenge of unsupervised cross-domain RL pre-training by decoupling data collection from encoder pre-training via decoupled random collection and by introducing efficient prototypical learning to train a cross-domain encoder $f_\theta$ and prototypes $\{c_j\}_{j=1}^M$ using a self-supervised objective $\mathcal{L}_{SSL}=\mathcal{L}_{comp}+\alpha\mathcal{L}_{intr}$ with Sinkhorn-Knopp targets and EMA-based targets. The approach yields state-of-the-art cross-domain downstream RL performance across eight DMControl domains, with a mean expert-normalized score of $1.956$ and only $54\%$ of the wall-clock pre-training time of baselines, demonstrating both strong generalization to unseen domains and substantial efficiency gains. By avoiding exploration-agent training and leveraging a static, diverse cross-domain dataset, CRPTpro reduces pre-training burden while preserving or enhancing policy performance. These results indicate cross-domain random data can be highly informative for representation learning, enabling a versatile, generalist encoder for multi-domain RL tasks and beyond.

Abstract

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Unsupervised cross-domain Reinforcement Learning (RL) pre-training shows great potential for challenging continuous visual control but poses a big challenge. In this paper, we propose \textbf{C}ross-domain \textbf{R}andom \textbf{P}re-\textbf{T}raining with \textbf{pro}totypes (CRPTpro), a novel, efficient, and effective self-supervised cross-domain RL pre-training framework. CRPTpro decouples data sampling from encoder pre-training, proposing decoupled random collection to easily and quickly generate a qualified cross-domain pre-training dataset. Moreover, a novel prototypical self-supervised algorithm is proposed to pre-train an effective visual encoder that is generic across different domains. Without finetuning, the cross-domain encoder can be implemented for challenging downstream tasks defined in different domains, either seen or unseen. Compared with recent advanced methods, CRPTpro achieves better performance on downstream policy learning without extra training on exploration agents for data collection, greatly reducing the burden of pre-training. We conduct extensive experiments across eight challenging continuous visual-control domains, including balance control, robot locomotion, and manipulation. CRPTpro significantly outperforms the next best Proto-RL(C) on 11/12 cross-domain downstream tasks with only 54.5\% wall-clock pre-training time, exhibiting state-of-the-art pre-training performance with greatly improved pre-training efficiency.

Cross-domain Random Pre-training with Prototypes for Reinforcement Learning

TL;DR

CRPTpro tackles the challenge of unsupervised cross-domain RL pre-training by decoupling data collection from encoder pre-training via decoupled random collection and by introducing efficient prototypical learning to train a cross-domain encoder and prototypes using a self-supervised objective with Sinkhorn-Knopp targets and EMA-based targets. The approach yields state-of-the-art cross-domain downstream RL performance across eight DMControl domains, with a mean expert-normalized score of and only of the wall-clock pre-training time of baselines, demonstrating both strong generalization to unseen domains and substantial efficiency gains. By avoiding exploration-agent training and leveraging a static, diverse cross-domain dataset, CRPTpro reduces pre-training burden while preserving or enhancing policy performance. These results indicate cross-domain random data can be highly informative for representation learning, enabling a versatile, generalist encoder for multi-domain RL tasks and beyond.

Abstract

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Unsupervised cross-domain Reinforcement Learning (RL) pre-training shows great potential for challenging continuous visual control but poses a big challenge. In this paper, we propose \textbf{C}ross-domain \textbf{R}andom \textbf{P}re-\textbf{T}raining with \textbf{pro}totypes (CRPTpro), a novel, efficient, and effective self-supervised cross-domain RL pre-training framework. CRPTpro decouples data sampling from encoder pre-training, proposing decoupled random collection to easily and quickly generate a qualified cross-domain pre-training dataset. Moreover, a novel prototypical self-supervised algorithm is proposed to pre-train an effective visual encoder that is generic across different domains. Without finetuning, the cross-domain encoder can be implemented for challenging downstream tasks defined in different domains, either seen or unseen. Compared with recent advanced methods, CRPTpro achieves better performance on downstream policy learning without extra training on exploration agents for data collection, greatly reducing the burden of pre-training. We conduct extensive experiments across eight challenging continuous visual-control domains, including balance control, robot locomotion, and manipulation. CRPTpro significantly outperforms the next best Proto-RL(C) on 11/12 cross-domain downstream tasks with only 54.5\% wall-clock pre-training time, exhibiting state-of-the-art pre-training performance with greatly improved pre-training efficiency.
Paper Structure (21 sections, 11 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 11 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Difference between three kinds of visual encoders in image-based RL on DMControl. (a) The single-task encoder. It is dedicated to only one task. (b) The single-domain encoder. It generalizes across tasks in a single domain, i.e., tasks from the same domain can share the same encoder. (c) The cross-domain encoder. It generalizes across both tasks and domains, i.e., tasks from different domains can share the same encoder. CRPTpro pre-trains a powerful cross-domain encoder enabling state-of-the-art downstream policy learning across multiple domains in challenging continuous visual control.
  • Figure 2: The schematic diagram of CRPTpro. In the pre-training, CRPTpro decouples data sampling from encoder pre-training, employing decoupled random collection to easily and quickly produce a qualified cross-domain pre-training dataset for SSL. Next, CRPTpro employs a novel self-supervised algorithm (efficient prototypical learning) over different data buffers sequentially and cyclically, to pre-train an effective cross-domain encoder and some prototypes. After the pre-training, the encoder and prototypes are frozen and used to perform efficient downstream RL on sets of challenging continuous visual-control tasks from different domains either seen or unseen. Finetuning in a single domain is optional, leading to better single-domain downstream policy learning but reducing the cross-domain versatility of the encoder to some extent.
  • Figure 3: The proposed efficient prototypical learning in CRPTpro. It learns a visual encoder $f_\theta$ and some basic vectors called prototypes $\{c_j\}_{j=1}^M$ for downstream RL. It contains a comparative loss $\mathcal{L}_{comp}$ to compare observations projected onto prototypes (serving as cluster centers) with their clustering assignment targets, and an intrinsic loss $\mathcal{L}_{intr}$ to facilitate the coverage and diffusion of prototypes.
  • Figure 4: Performance comparison between CRPTpro-finetuning and non-cross-domain baselines. The encoder is pre-trained by CRPTpro on Group-A and then finetuned on 3 unseen domains and 2 seen domains respectively. Finetuning is helpful on both seen domains and unseen domains. CRPTpro-finetuning could be regarded as a single-domain pre-training method, achieving competitive downstream performance with state-of-the-art single-domain pre-training Proto-RL(S) which is also one of state-of-the-art image-based RL methods.
  • Figure 5: Up: Ablating $\mathcal{L}_{intr}$ from CRPTpro and adding $\mathcal{L}_{intr}$ into Proto-RL(C). This figure also serves as the ablation study of both the decoupled random collection and efficient prototypical learning in CRPTpro. Down: Adding $\mathcal{L}_{intr}$ into Proto-RL(S). Efficient prototypical learning is effective when employed in 3 different pre-training settings.
  • ...and 4 more figures