Decoupling Exploration and Exploitation for Unsupervised Pre-training with Successor Features
JaeYoon Kim, Junyu Xuan, Christy Liang, Farookh Hussain
TL;DR
This work tackles unsupervised pre-training for reinforcement learning by leveraging successor features (SFs) to decouple environment dynamics from rewards, addressing issues of local optima in monolithic exploration. It introduces NMPS, a non-monolithic exploration framework that splits exploitation and exploration into separate agents controlled by a mode-switching mechanism based on the value promise discrepancy $D_{promise}$, and trains these agents via off-policy DDPG with flexible data routing through replay buffers $D_R^{\#}$. The Exploit agent uses SFs to preserve task inference during fine-tuning, while a competence-based Explor agent (e.g., DIAYN or APS-based) enhances exploration and expands discriminator generalization; after pre-training, only the Exploit component is used for downstream tasks. Empirical results on Walker, Jaco Arm, and Quadruped in the DeepMind Control Suite show that NMPS variants generally outperform APS and other baselines, with Quadruped achieving the strongest gains under certain configurations (e.g., NMPS_D_sep^{# # D_A10}), illustrating the benefits of decoupled exploration and flexible discriminators for unsupervised pre-training and subsequent task adaptation.
Abstract
Unsupervised pre-training has been on the lookout for the virtue of a value function representation referred to as successor features (SFs), which decouples the dynamics of the environment from the rewards. It has a significant impact on the process of task-specific fine-tuning due to the decomposition. However, existing approaches struggle with local optima due to the unified intrinsic reward of exploration and exploitation without considering the linear regression problem and the discriminator supporting a small skill sapce. We propose a novel unsupervised pre-training model with SFs based on a non-monolithic exploration methodology. Our approach pursues the decomposition of exploitation and exploration of an agent built on SFs, which requires separate agents for the respective purpose. The idea will leverage not only the inherent characteristics of SFs such as a quick adaptation to new tasks but also the exploratory and task-agnostic capabilities. Our suggested model is termed Non-Monolithic unsupervised Pre-training with Successor features (NMPS), which improves the performance of the original monolithic exploration method of pre-training with SFs. NMPS outperforms Active Pre-training with Successor Features (APS) in a comparative experiment.
