ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery

Xin Liu; Yaran Chen; Dongbin Zhao

ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery

Xin Liu, Yaran Chen, Dongbin Zhao

TL;DR

ComSD tackles unsupervised skill discovery by balancing state exploration and skill diversity in environments with rich, hard-to-distinguish skills. It introduces a contrastive dynamic reward that combines particle-based state entropy for exploration with a contrastive diversity term for discriminating skills, governed by a dynamic Skill-based dynaMic Weighting (SMW) that adjusts the balance based on skill vectors. The approach yields state-of-the-art downstream adaptation on multi-joint robots and enables far-reaching, distinguishable exploration skills in challenging mazes, outperforming multiple baselines across 15/16 skill-combination tasks and showing competitive results in finetuning. While pixel-based transfers improve with an auxiliary contrastive target, there remains a gap to state-based performance, motivating future work on automatic weight-range learning and more robust pixel-domain integration.

Abstract

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. Unsupervised skill discovery seeks to acquire different useful skills without extrinsic reward via unsupervised Reinforcement Learning (RL), with the discovered skills efficiently adapting to multiple downstream tasks in various ways. However, recent advanced skill discovery methods struggle to well balance state exploration and skill diversity, particularly when the potential skills are rich and hard to discern. In this paper, we propose \textbf{Co}ntrastive dyna\textbf{m}ic \textbf{S}kill \textbf{D}iscovery \textbf{(ComSD)}\footnote{Code and videos: https://github.com/liuxin0824/ComSD} which generates diverse and exploratory unsupervised skills through a novel intrinsic incentive, named contrastive dynamic reward. It contains a particle-based exploration reward to make agents access far-reaching states for exploratory skill acquisition, and a novel contrastive diversity reward to promote the discriminability between different skills. Moreover, a novel dynamic weighting mechanism between the above two rewards is proposed to balance state exploration and skill diversity, which further enhances the quality of the discovered skills. Extensive experiments and analysis demonstrate that ComSD can generate diverse behaviors at different exploratory levels for multi-joint robots, enabling state-of-the-art adaptation performance on challenging downstream tasks. It can also discover distinguishable and far-reaching exploration skills in the challenging tree-like 2D maze.

ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery

TL;DR

Abstract

Paper Structure (24 sections, 12 equations, 11 figures, 4 tables, 3 algorithms)

This paper contains 24 sections, 12 equations, 11 figures, 4 tables, 3 algorithms.

Introduction & Research Background
Related Works
Preliminaries
Problem Definition
Mutual Information Objective of Unsupervised RL
Unsupervised Skill Discovery by ComSD
Skill-conditioned Entropy Estimation via Contrastive Learning
Particle-based State Entropy Estimation
Skill-based DynaMic Weighting (SMW)
Experiments & Analysis
Environments
Baselines
Skill Combination
Skill Finetuning
Adaptation Ablation
...and 9 more sections

Figures (11)

Figure 1: Left: The rich, hard-to-learn, and hard-to-distinguish potential skills significantly increase the difficulty of unsupervised skill discovery. First, the difficulty of learning exploratory skills, i.e., promoting state exploration, is increased. For example, learning to flip and walk for a robot is much harder than reaching two different goals due to the more complex robot kinetics. Second, the difficulty of discerning skills, i.e., promoting skill diversity, is increased. For example, reaching different goals can be easily represented by different ending states, while flipping and walking to the same location cannot be differentiated by only ending states. Third, behaviors at different exploratory levels are helpful for robots (e.g., static standing or dynamic running) but are not considered in goal reaching. Right: ComSD's pipeline. ComSD discovers exploratory and diverse robot behaviors through a novel contrastive dynamic reward, showing state-of-the-art performance on multiple downstream tasks in different kinds of evaluations.
Figure 2: An example to illustrate why reducing exploration within each skill (increasing $-H(\tau|z)$) with the overall exploration of all skills ($H(\tau)$) guaranteed can increase the skill diversity. The red skill reaches both the red goal and the green goal originally, while the green skill reaches only the green goal. Here we ignore trajectories, treating different goals as different states. Increasing $-H(\tau|z)$ forces the red skill to give up one of the two goals for a lower own state exploration. If the red skill reaches the green goal like the green skill (right in the figure), no skill can reach the red goal, and the overall state coverage $H(\tau)$ will be reduced. To this end, the only way to increase $-H(\tau|z)$ without hurting $H(\tau)$ is that the red skill reaches the red goal (left in the figure), which means a larger difference between two skills.
Figure 3: Left: The contrastive dynamic reward $r^{intr}_{ComSD}$ design. Right: In SMW, $\beta$ is linear related to $\rm flag \it(z)$ with different slopes in different region. The skill space is divided into different regions with different learning objectives.
Figure 4: The training curve of all 7 methods on 16 skill combination downstream tasks. ComSD outperforms all 6 baselines significantly across 15/16 downstream tasks, demonstrating that ComSD discovers much more exploratory and diverse behaviors than other methods for challenging multi-joint robots.
Figure 5: Adaptation ablation experiments on (left two) skill combination tasks and (right two) skill finetuning tasks. Our contrastive diversity reward ($r^{diversity}_{contrast}$) and skill-based dynamic weighting (SMW) are both necessary for ComSD to achieve the advanced results on both kinds of adaptation evaluations.
...and 6 more figures

ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery

TL;DR

Abstract

ComSD: Balancing Behavioral Quality and Diversity in Unsupervised Skill Discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (11)