Table of Contents
Fetching ...

Decoupling Exploration and Exploitation for Unsupervised Pre-training with Successor Features

JaeYoon Kim, Junyu Xuan, Christy Liang, Farookh Hussain

TL;DR

This work tackles unsupervised pre-training for reinforcement learning by leveraging successor features (SFs) to decouple environment dynamics from rewards, addressing issues of local optima in monolithic exploration. It introduces NMPS, a non-monolithic exploration framework that splits exploitation and exploration into separate agents controlled by a mode-switching mechanism based on the value promise discrepancy $D_{promise}$, and trains these agents via off-policy DDPG with flexible data routing through replay buffers $D_R^{\#}$. The Exploit agent uses SFs to preserve task inference during fine-tuning, while a competence-based Explor agent (e.g., DIAYN or APS-based) enhances exploration and expands discriminator generalization; after pre-training, only the Exploit component is used for downstream tasks. Empirical results on Walker, Jaco Arm, and Quadruped in the DeepMind Control Suite show that NMPS variants generally outperform APS and other baselines, with Quadruped achieving the strongest gains under certain configurations (e.g., NMPS_D_sep^{# # D_A10}), illustrating the benefits of decoupled exploration and flexible discriminators for unsupervised pre-training and subsequent task adaptation.

Abstract

Unsupervised pre-training has been on the lookout for the virtue of a value function representation referred to as successor features (SFs), which decouples the dynamics of the environment from the rewards. It has a significant impact on the process of task-specific fine-tuning due to the decomposition. However, existing approaches struggle with local optima due to the unified intrinsic reward of exploration and exploitation without considering the linear regression problem and the discriminator supporting a small skill sapce. We propose a novel unsupervised pre-training model with SFs based on a non-monolithic exploration methodology. Our approach pursues the decomposition of exploitation and exploration of an agent built on SFs, which requires separate agents for the respective purpose. The idea will leverage not only the inherent characteristics of SFs such as a quick adaptation to new tasks but also the exploratory and task-agnostic capabilities. Our suggested model is termed Non-Monolithic unsupervised Pre-training with Successor features (NMPS), which improves the performance of the original monolithic exploration method of pre-training with SFs. NMPS outperforms Active Pre-training with Successor Features (APS) in a comparative experiment.

Decoupling Exploration and Exploitation for Unsupervised Pre-training with Successor Features

TL;DR

This work tackles unsupervised pre-training for reinforcement learning by leveraging successor features (SFs) to decouple environment dynamics from rewards, addressing issues of local optima in monolithic exploration. It introduces NMPS, a non-monolithic exploration framework that splits exploitation and exploration into separate agents controlled by a mode-switching mechanism based on the value promise discrepancy , and trains these agents via off-policy DDPG with flexible data routing through replay buffers . The Exploit agent uses SFs to preserve task inference during fine-tuning, while a competence-based Explor agent (e.g., DIAYN or APS-based) enhances exploration and expands discriminator generalization; after pre-training, only the Exploit component is used for downstream tasks. Empirical results on Walker, Jaco Arm, and Quadruped in the DeepMind Control Suite show that NMPS variants generally outperform APS and other baselines, with Quadruped achieving the strongest gains under certain configurations (e.g., NMPS_D_sep^{# # D_A10}), illustrating the benefits of decoupled exploration and flexible discriminators for unsupervised pre-training and subsequent task adaptation.

Abstract

Unsupervised pre-training has been on the lookout for the virtue of a value function representation referred to as successor features (SFs), which decouples the dynamics of the environment from the rewards. It has a significant impact on the process of task-specific fine-tuning due to the decomposition. However, existing approaches struggle with local optima due to the unified intrinsic reward of exploration and exploitation without considering the linear regression problem and the discriminator supporting a small skill sapce. We propose a novel unsupervised pre-training model with SFs based on a non-monolithic exploration methodology. Our approach pursues the decomposition of exploitation and exploration of an agent built on SFs, which requires separate agents for the respective purpose. The idea will leverage not only the inherent characteristics of SFs such as a quick adaptation to new tasks but also the exploratory and task-agnostic capabilities. Our suggested model is termed Non-Monolithic unsupervised Pre-training with Successor features (NMPS), which improves the performance of the original monolithic exploration method of pre-training with SFs. NMPS outperforms Active Pre-training with Successor Features (APS) in a comparative experiment.
Paper Structure (18 sections, 10 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 10 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: An example of noise-based monolithic exploration (left) and non-monolithic exploration (right). In the noise-based monolithic exploration, noise and the agent's action play roles in exploration and exploitation, respectively. The final action, 'Action', taken by the agent in the environment is the result of adding the action with a noise. In the non-monolithic exploration, the exploitation agent and the exploration agent work for their own purposes with the help of a mode-switching controller. The mode-switching controller considers the state of one or both agents.
  • Figure 2: The architecture of our proposed pre-training model, NMPS, (right) compared with Active Pre-training with Successor Features (left)
  • Figure 3: The comparison result of fine-tuning of NMPS (the best one), APS, DIAYN and SMM on Walker (a), Jaco Arm (b) and Quadruped (c) by using the intrinsic reward of APS or DIAYN for Explor of NMPS.
  • Figure 4: The comparison result of fine-tuning of all $NMPS\_X$s of NMPS and APS on Walker (a) and Jaco Arm (b, smoothed line) and most NMPS variants and APS on Quadruped (c) by using the intrinsic reward of APS or DIAYN for Explor of NMPS.