Table of Contents
Fetching ...

Diversity Progress for Goal Selection in Discriminability-Motivated RL

Erik M. Lintunen, Nadia M. Ady, Christian Guckelsberger

TL;DR

It is demonstrated empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution -- a known issue with some prior approaches.

Abstract

Non-uniform goal selection has the potential to improve the reinforcement learning (RL) of skills over uniform-random selection. In this paper, we introduce a method for learning a goal-selection policy in intrinsically-motivated goal-conditioned RL: "Diversity Progress" (DP). The learner forms a curriculum based on observed improvement in discriminability over its set of goals. Our proposed method is applicable to the class of discriminability-motivated agents, where the intrinsic reward is computed as a function of the agent's certainty of following the true goal being pursued. This reward can motivate the agent to learn a set of diverse skills without extrinsic rewards. We demonstrate empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution -- a known issue with some prior approaches. We end with plans to take this proof-of-concept forward.

Diversity Progress for Goal Selection in Discriminability-Motivated RL

TL;DR

It is demonstrated empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution -- a known issue with some prior approaches.

Abstract

Non-uniform goal selection has the potential to improve the reinforcement learning (RL) of skills over uniform-random selection. In this paper, we introduce a method for learning a goal-selection policy in intrinsically-motivated goal-conditioned RL: "Diversity Progress" (DP). The learner forms a curriculum based on observed improvement in discriminability over its set of goals. Our proposed method is applicable to the class of discriminability-motivated agents, where the intrinsic reward is computed as a function of the agent's certainty of following the true goal being pursued. This reward can motivate the agent to learn a set of diverse skills without extrinsic rewards. We demonstrate empirically that a DP-motivated agent can learn a set of distinguishable skills faster than previous approaches, and do so without suffering from a collapse of the goal distribution -- a known issue with some prior approaches. We end with plans to take this proof-of-concept forward.

Paper Structure

This paper contains 23 sections, 11 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: The effective number of skills over training time in three environments. The agent is learning $20$ skills. We compare , , and with two different softmax temperatures (0.1 and 0.3) determining how greedy the policy is. The linear decline of the effective number of skills for epochs up to the number of skills is due to 's initialisation, that is, randomly selecting goals without replacement (see Algorithm \ref{['alg:diversity']}, ll. 6--10). Results from five random seeds; each line is a seed.
  • Figure 2: The effects of on goal selection over training time in the Half-Cheetah environment. The agent is learning five skills, each shown in a different colour. Upper left: values, $\overline{\textit{DP}}$; updated for the current skill at the end of an epoch. Lower left: goal-selection probabilities after the softmax transformation. Right: cumulative frequencies of goal selection. The initial trend for epochs up to the number of skills is due to 's initialisation, where goals are selected randomly without replacement (see Algorithm \ref{['alg:diversity']}, ll. 6--10). The plotted traces represent data from running a single random seed.
  • Figure 3: A dimension-reduced feature space (tsne) for 20 skills in the Half-Cheetah environment. For each skill (one colour), the $100$ data points represent i.i.d. draws of trajectories. Left: randomly initialised skills with no training. Middle, Right: trajectories sampled after $100$ epochs of training. Note both and eventually learn distinguishable skills. Experiment details in Appendix \ref{['appendix:tsne']}.
  • Figure 4: Trajectories drawn from eight stochastic skills in our modified version of the 2D Navigation environment constructed by eysenbach2019diversity. Left: random skills with no training. Right: DIAYN-learned skills. Note, these visualisations are provided for intuition, showing trajectories only 15 steps long. In Figure \ref{['fig:effectivenskills']}(a), we included 20 skills and trajectories were 100 steps long.