Table of Contents
Fetching ...

Can a MISL Fly? Analysis and Ingredients for Mutual Information Skill Learning

Chongyi Zheng, Jens Tuyls, Joanne Peng, Benjamin Eysenbach

TL;DR

The paper shows that METRA's success can be understood within the MISL framework, revealing that METRA's representation objective behaves like a contrastive loss and its actor objective aligns with an information bottleneck. Leveraging these insights, it introduces Contrastive Successor Features (CSF), a simpler MISL method that learns representations via a contrastive lower bound and trains a policy with successor features, avoiding METRA's dual-gradient representation optimization. Empirical results on six continuous-control tasks demonstrate CSF achieving near-state-of-the-art performance on exploration and downstream tasks, with ablations highlighting the importance of the information bottleneck-based intrinsic reward and the chosen representation parameterization. This work reinforces MISL as a viable foundation for unsupervised skill discovery and provides a practical, simpler alternative to Wasserstein-based MI approaches.

Abstract

Self-supervised learning has the potential of lifting several of the key challenges in reinforcement learning today, such as exploration, representation learning, and reward design. Recent work (METRA) has effectively argued that moving away from mutual information and instead optimizing a certain Wasserstein distance is important for good performance. In this paper, we argue that the benefits seen in that paper can largely be explained within the existing framework of mutual information skill learning (MISL). Our analysis suggests a new MISL method (contrastive successor features) that retains the excellent performance of METRA with fewer moving parts, and highlights connections between skill learning, contrastive representation learning, and successor features. Finally, through careful ablation studies, we provide further insight into some of the key ingredients for both our method and METRA.

Can a MISL Fly? Analysis and Ingredients for Mutual Information Skill Learning

TL;DR

The paper shows that METRA's success can be understood within the MISL framework, revealing that METRA's representation objective behaves like a contrastive loss and its actor objective aligns with an information bottleneck. Leveraging these insights, it introduces Contrastive Successor Features (CSF), a simpler MISL method that learns representations via a contrastive lower bound and trains a policy with successor features, avoiding METRA's dual-gradient representation optimization. Empirical results on six continuous-control tasks demonstrate CSF achieving near-state-of-the-art performance on exploration and downstream tasks, with ablations highlighting the importance of the information bottleneck-based intrinsic reward and the chosen representation parameterization. This work reinforces MISL as a viable foundation for unsupervised skill discovery and provides a practical, simpler alternative to Wasserstein-based MI approaches.

Abstract

Self-supervised learning has the potential of lifting several of the key challenges in reinforcement learning today, such as exploration, representation learning, and reward design. Recent work (METRA) has effectively argued that moving away from mutual information and instead optimizing a certain Wasserstein distance is important for good performance. In this paper, we argue that the benefits seen in that paper can largely be explained within the existing framework of mutual information skill learning (MISL). Our analysis suggests a new MISL method (contrastive successor features) that retains the excellent performance of METRA with fewer moving parts, and highlights connections between skill learning, contrastive representation learning, and successor features. Finally, through careful ablation studies, we provide further insight into some of the key ingredients for both our method and METRA.

Paper Structure

This paper contains 44 sections, 8 theorems, 39 equations, 13 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

The optimal state representation $\phi^{\star}$ of the actual METRA representation objective (Eq. eq:repr-loss-expected-transition) satisfies

Figures (13)

  • Figure 1: From METRA to MISL.(Left) METRA argues that optimizing a Wasserstein distance is superior to using mutual information. (Right) Through careful analysis, we show METRA still bears striking similarities to MISL algorithms, which allows us to develop a new MISL algorithm (CSF) that matches the performance of METRA while retaining the theoretical properties associated with MI maximization.
  • Figure 2: Histograms of METRA representations.(a) The expected distance of representations converges to $1.0$, helping to explain what objective METRA's representations are optimizing. (b) Given a latent skill, the conditional difference in representations ($\phi(s') - \phi(s) \mid z$) converges to an isotropic Gaussian distribution. (c) Taking the marginal over latent skills, the normalized difference in representations $\left(\frac{(\phi(s') - \phi(s))}{\lVert\phi(s') - \phi(s)\rVert_2}\right)$ converges to a $\textsc{Unif}({\mathbb{S}}^{d - 1})$. These observations are consistent with our theoretical analysis (Cor. \ref{['coro:connect-metra-and-contrastive']}), suggesting that METRA is performing a form of contrastive learning.
  • Figure 3: Ablation studies.(Left) Replacing the METRA representation loss with a contrastive loss retains performance. (Center) Using an information bottleneck to define the intrinsic reward is important for MISL. (Right) Choosing the right parameterization is crucial for good performance. Shaded areas indicate 1 std. dev.
  • Figure 4: CSF performs on par with METRA. We compare CSF with baselines on state coverage (left), zero-shot goal reaching (middle), and hierarchical control (right). CSF performs roughly on par with METRA and outperforms all other baselines in most settings. Shaded areas indicate one standard deviation. Appendix Fig. \ref{['fig:state-space-coverage']}, \ref{['fig:goal-reaching']}& \ref{['fig:hierarchical-control']} show the learning curves for all tasks.
  • Figure 5: State space coverage. We plot the unique number of coordinates visited by the agent, except for the Kitchen where we plot the task coverage. We find CSF matches the prior state-of-the-art MISL algorithms on $4 / 6$ tasks, and strongly outperforms METRA in Robobin. Shaded areas indicate one standard deviation.
  • ...and 8 more figures

Theorems & Definitions (14)

  • Proposition 1
  • Proposition 2
  • Corollary 1
  • Proposition 3
  • Proposition 3
  • proof
  • Proposition 3
  • proof
  • Proposition 3
  • proof
  • ...and 4 more