Table of Contents
Fetching ...

Exploration by Learning Diverse Skills through Successor State Measures

Paul-Antoine Le Tolguenec, Yann Besse, Florent Teichteil-Konigsbuch, Dennis G. Wilson, Emmanuel Rachelson

TL;DR

This work proposes a formalization of this search for diverse skills, building on a previous definition based on the mutual information between states and skills, which considers the distribution of states reached by a policy conditioned on each skill and leverage the successor state measure to maximize the difference between these skill distributions.

Abstract

The ability to perform different skills can encourage agents to explore. In this work, we aim to construct a set of diverse skills which uniformly cover the state space. We propose a formalization of this search for diverse skills, building on a previous definition based on the mutual information between states and skills. We consider the distribution of states reached by a policy conditioned on each skill and leverage the successor state measure to maximize the difference between these skill distributions. We call this approach LEADS: Learning Diverse Skills through Successor States. We demonstrate our approach on a set of maze navigation and robotic control tasks which show that our method is capable of constructing a diverse set of skills which exhaustively cover the state space without relying on reward or exploration bonuses. Our findings demonstrate that this new formalization promotes more robust and efficient exploration by combining mutual information maximization and exploration bonuses.

Exploration by Learning Diverse Skills through Successor State Measures

TL;DR

This work proposes a formalization of this search for diverse skills, building on a previous definition based on the mutual information between states and skills, which considers the distribution of states reached by a policy conditioned on each skill and leverage the successor state measure to maximize the difference between these skill distributions.

Abstract

The ability to perform different skills can encourage agents to explore. In this work, we aim to construct a set of diverse skills which uniformly cover the state space. We propose a formalization of this search for diverse skills, building on a previous definition based on the mutual information between states and skills. We consider the distribution of states reached by a policy conditioned on each skill and leverage the successor state measure to maximize the difference between these skill distributions. We call this approach LEADS: Learning Diverse Skills through Successor States. We demonstrate our approach on a set of maze navigation and robotic control tasks which show that our method is capable of constructing a diverse set of skills which exhaustively cover the state space without relying on reward or exploration bonuses. Our findings demonstrate that this new formalization promotes more robust and efficient exploration by combining mutual information maximization and exploration bonuses.
Paper Structure (17 sections, 16 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 16 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: State distributions of two sets $\mathcal{Z}_1$ (left) and $\mathcal{Z}_2$ (right) of four skills each on a grid maze. Each skill's visited states are represented by a different symbol and distributed uniformly.
  • Figure 2: Skill visualisation for each algorithm. Per algorithm, the tasks are the mazes Easy (top left), U (top right), Hard (bottom left), and the control task Fetch-Reach (bottom right).
  • Figure 3: LEADS exploration of the Hand environment state space, using $n_{\text{skill}}=12$ skills and a PCA over all explored states.
  • Figure 4: (a): The SSM $m(s_0,s,z)$ at the final epoch on Hard maze, per skill, normalized in $[0, 1]$. (b): The uncertainty measure $u(s,z)$ at the final epoch on Hard maze, per skill, with the maximum state highlighted.
  • Figure 5: Relative coverage evolution across six tasks. The x-axis represents the number of samples collected since the algorithm began.
  • ...and 7 more figures