Table of Contents
Fetching ...

Constrained Ensemble Exploration for Unsupervised Skill Discovery

Chenjia Bai, Rushuai Yang, Qiaosheng Zhang, Kang Xu, Yi Chen, Ting Xiao, Xuelong Li

TL;DR

CeSD addresses unsupervised RL by decoupling skill discovery from MI estimation through a constrained ensemble of skill-specific value functions. It leverages self-supervised prototypes to partition the state space into clusters and trains per-cluster Q-networks with entropy-based intrinsic rewards, augmented by a state-distribution constraint to ensure non-overlapping, distinguishable skills. Theoretical results show per-skill state entropy grows and that partition exploration achieves near-maximal global state coverage, while experiments on 2D mazes and the URLB benchmark demonstrate state-of-the-art downstream adaptation. The approach reduces reliance on MI estimates and yields a diverse set of meaningful skills enabling fast fine-tuning for varied tasks.

Abstract

Unsupervised Reinforcement Learning (RL) provides a promising paradigm for learning useful behaviors via reward-free per-training. Existing methods for unsupervised RL mainly conduct empowerment-driven skill discovery or entropy-based exploration. However, empowerment often leads to static skills, and pure exploration only maximizes the state coverage rather than learning useful behaviors. In this paper, we propose a novel unsupervised RL framework via an ensemble of skills, where each skill performs partition exploration based on the state prototypes. Thus, each skill can explore the clustered area locally, and the ensemble skills maximize the overall state coverage. We adopt state-distribution constraints for the skill occupancy and the desired cluster for learning distinguishable skills. Theoretical analysis is provided for the state entropy and the resulting skill distributions. Based on extensive experiments on several challenging tasks, we find our method learns well-explored ensemble skills and achieves superior performance in various downstream tasks compared to previous methods.

Constrained Ensemble Exploration for Unsupervised Skill Discovery

TL;DR

CeSD addresses unsupervised RL by decoupling skill discovery from MI estimation through a constrained ensemble of skill-specific value functions. It leverages self-supervised prototypes to partition the state space into clusters and trains per-cluster Q-networks with entropy-based intrinsic rewards, augmented by a state-distribution constraint to ensure non-overlapping, distinguishable skills. Theoretical results show per-skill state entropy grows and that partition exploration achieves near-maximal global state coverage, while experiments on 2D mazes and the URLB benchmark demonstrate state-of-the-art downstream adaptation. The approach reduces reliance on MI estimates and yields a diverse set of meaningful skills enabling fast fine-tuning for varied tasks.

Abstract

Unsupervised Reinforcement Learning (RL) provides a promising paradigm for learning useful behaviors via reward-free per-training. Existing methods for unsupervised RL mainly conduct empowerment-driven skill discovery or entropy-based exploration. However, empowerment often leads to static skills, and pure exploration only maximizes the state coverage rather than learning useful behaviors. In this paper, we propose a novel unsupervised RL framework via an ensemble of skills, where each skill performs partition exploration based on the state prototypes. Thus, each skill can explore the clustered area locally, and the ensemble skills maximize the overall state coverage. We adopt state-distribution constraints for the skill occupancy and the desired cluster for learning distinguishable skills. Theoretical analysis is provided for the state entropy and the resulting skill distributions. Based on extensive experiments on several challenging tasks, we find our method learns well-explored ensemble skills and achieves superior performance in various downstream tasks compared to previous methods.
Paper Structure (41 sections, 7 theorems, 35 equations, 13 figures, 6 tables, 2 algorithms)

This paper contains 41 sections, 7 theorems, 35 equations, 13 figures, 6 tables, 2 algorithms.

Key Result

Theorem 3.1

Let each cluster have the same number of samples, for $i\in[n]$, the relationship between the maximum entropy of $\pi^*$ in the state set ${\mathbb{S}}$ and $\pi_i^*$ in the cluster set ${\mathbb{S}}_i$ is where $C(n)=\log n$ depends on the number of clusters $n$.

Figures (13)

  • Figure 1: The partition exploration process. We adopt Sinkhorn-Knopp algorithm to learn prototypes and perform clustering for states. The intrinsic reward is calculated by entropy estimation within each cluster and then used for training a specific $Q$-network.
  • Figure 2: The learning process of CeSD. After initializing skills, we conduct entropy-based exploration for each skill and perform clustering to obtain non-overlapping clusters. Then the state distribution constraint is applied to enhance the diversity of skills. The regularized skills are used for partition exploration in the next round of iteration.
  • Figure 3: Visualization of skill discovery in Maze. Different colors represent the state trajectories with different skill vectors. We let the agent start moving from the black dot in the upper left corner and sample 20 trajectories for each skill for visualization.
  • Figure 4: Comparison of performance in 12 downstream tasks of URLB benchmark. We report the aggregate statistics of 10 seeds by following agarwal2021IQM after finetuning. CeSD achieves the new state-of-the-art results in the URLB benchmark.
  • Figure 5: An illustration of the rolling skill learned in Quadruped.
  • ...and 8 more figures

Theorems & Definitions (11)

  • Theorem 3.1
  • Lemma 3.2
  • Theorem 3.3
  • Corollary 3.4
  • proof
  • Theorem : Restate of Theorem \ref{['thm:entropy']}
  • proof
  • Lemma : Restate of Lemma \ref{['lemma:distribution']}
  • proof
  • Theorem : Restate of Theorem \ref{['thm:distance']}
  • ...and 1 more