Cross-domain Random Pre-training with Prototypes for Reinforcement Learning
Xin Liu, Yaran Chen, Haoran Li, Boyu Li, Dongbin Zhao
TL;DR
CRPTpro tackles the challenge of unsupervised cross-domain RL pre-training by decoupling data collection from encoder pre-training via decoupled random collection and by introducing efficient prototypical learning to train a cross-domain encoder $f_\theta$ and prototypes $\{c_j\}_{j=1}^M$ using a self-supervised objective $\mathcal{L}_{SSL}=\mathcal{L}_{comp}+\alpha\mathcal{L}_{intr}$ with Sinkhorn-Knopp targets and EMA-based targets. The approach yields state-of-the-art cross-domain downstream RL performance across eight DMControl domains, with a mean expert-normalized score of $1.956$ and only $54\%$ of the wall-clock pre-training time of baselines, demonstrating both strong generalization to unseen domains and substantial efficiency gains. By avoiding exploration-agent training and leveraging a static, diverse cross-domain dataset, CRPTpro reduces pre-training burden while preserving or enhancing policy performance. These results indicate cross-domain random data can be highly informative for representation learning, enabling a versatile, generalist encoder for multi-domain RL tasks and beyond.
Abstract
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Unsupervised cross-domain Reinforcement Learning (RL) pre-training shows great potential for challenging continuous visual control but poses a big challenge. In this paper, we propose \textbf{C}ross-domain \textbf{R}andom \textbf{P}re-\textbf{T}raining with \textbf{pro}totypes (CRPTpro), a novel, efficient, and effective self-supervised cross-domain RL pre-training framework. CRPTpro decouples data sampling from encoder pre-training, proposing decoupled random collection to easily and quickly generate a qualified cross-domain pre-training dataset. Moreover, a novel prototypical self-supervised algorithm is proposed to pre-train an effective visual encoder that is generic across different domains. Without finetuning, the cross-domain encoder can be implemented for challenging downstream tasks defined in different domains, either seen or unseen. Compared with recent advanced methods, CRPTpro achieves better performance on downstream policy learning without extra training on exploration agents for data collection, greatly reducing the burden of pre-training. We conduct extensive experiments across eight challenging continuous visual-control domains, including balance control, robot locomotion, and manipulation. CRPTpro significantly outperforms the next best Proto-RL(C) on 11/12 cross-domain downstream tasks with only 54.5\% wall-clock pre-training time, exhibiting state-of-the-art pre-training performance with greatly improved pre-training efficiency.
