Skill-aware Mutual Information Optimisation for Generalisation in Reinforcement Learning

Xuehui Yu; Mhairi Dunion; Xin Li; Stefano V. Albrecht

Skill-aware Mutual Information Optimisation for Generalisation in Reinforcement Learning

Xuehui Yu, Mhairi Dunion, Xin Li, Stefano V. Albrecht

TL;DR

This work empirically finds that RL agents that learn by maximising SaMI achieve substantially improved zero-shot generalisation to unseen tasks and the context encoder trained with SaNCE demonstrates greater robustness to a reduction in the number of available samples, thus possessing the potential to overcome the $\log$-$K$ curse.

Abstract

Meta-Reinforcement Learning (Meta-RL) agents can struggle to operate across tasks with varying environmental features that require different optimal skills (i.e., different modes of behaviour). Using context encoders based on contrastive learning to enhance the generalisability of Meta-RL agents is now widely studied but faces challenges such as the requirement for a large sample size, also referred to as the $\log$-$K$ curse. To improve RL generalisation to different tasks, we first introduce Skill-aware Mutual Information (SaMI), an optimisation objective that aids in distinguishing context embeddings according to skills, thereby equipping RL agents with the ability to identify and execute different skills across tasks. We then propose Skill-aware Noise Contrastive Estimation (SaNCE), a $K$-sample estimator used to optimise the SaMI objective. We provide a framework for equipping an RL agent with SaNCE in practice and conduct experimental validation on modified MuJoCo and Panda-gym benchmarks. We empirically find that RL agents that learn by maximising SaMI achieve substantially improved zero-shot generalisation to unseen tasks. Additionally, the context encoder trained with SaNCE demonstrates greater robustness to a reduction in the number of available samples, thus possessing the potential to overcome the $\log$-$K$ curse.

Skill-aware Mutual Information Optimisation for Generalisation in Reinforcement Learning

TL;DR

curse.

Abstract

curse. To improve RL generalisation to different tasks, we first introduce Skill-aware Mutual Information (SaMI), an optimisation objective that aids in distinguishing context embeddings according to skills, thereby equipping RL agents with the ability to identify and execute different skills across tasks. We then propose Skill-aware Noise Contrastive Estimation (SaNCE), a

-sample estimator used to optimise the SaMI objective. We provide a framework for equipping an RL agent with SaNCE in practice and conduct experimental validation on modified MuJoCo and Panda-gym benchmarks. We empirically find that RL agents that learn by maximising SaMI achieve substantially improved zero-shot generalisation to unseen tasks. Additionally, the context encoder trained with SaNCE demonstrates greater robustness to a reduction in the number of available samples, thus possessing the potential to overcome the

curse.

Paper Structure (33 sections, 3 theorems, 14 equations, 27 figures, 7 tables, 2 algorithms)

This paper contains 33 sections, 3 theorems, 14 equations, 27 figures, 7 tables, 2 algorithms.

Introduction
Related works
Preliminaries
Skill-aware mutual information optimisation for Meta-RL
The $\boldsymbol{\log}$-$\boldsymbol{K}$ curse of $\boldsymbol{K}$-sample MI estimators
Skill-aware mutual information: a smaller ground-truth MI
Skill-aware noise contrastive estimation: a tighter $\boldsymbol{K}$-sample estimator
Skill-aware trajectory sampling strategy
Experiments
Experimental setup
Panda-gym
MuJoCo
Analysis of the $\boldsymbol{\log}$-$\boldsymbol{K}$ curse in sample-limited scenarios
Conclusion and future work
Proof of Lemma \ref{['lemma:lemma_1']}
...and 18 more sections

Key Result

Lemma 1

Learning a context encoder $\psi$ with a $K$-sample estimator and finite sample size $K$, we have ${I}_{\text{InfoNCE}}(x;y|\psi,K)$$\leq$$\log K$$\leq$$I(x;y)$, when $x \not \! \perp \!\!\! \perp y$ (see proof in Appendix appx:lemma_1).

Figures (27)

Figure 1: (a) In a cube-moving environment, tasks are defined according to different environmental features. (b) Different tasks have different transition dynamics caused by underlying environmental features, hence optimal skills are different across tasks.
Figure 2: A policy $\pi$ conditioned on a fixed context embedding $c$ is defined as a skill $\pi(\cdot|c)$ (shortened as $\pi_c$). The policy $\pi$ conditioned on a fixed $c$ alters the state of the environment in a consistent way, thereby exhibiting a mode of skill. The skill $\pi(\cdot|c_1)$ moves the cube on the table in trajectory $\tau^+_{c_1}$ and is referred to as the Push skill; correspondingly, the Pick&Place skill $\pi(\cdot|c_2)$ takes the cube off the table and places it in the goal position in the trajectory $\tau^+_{c_2}$.
Figure 3: $I_{\text{InfoNCE}(c;\pi_c;\tau_c)}$, with a finite sample size of $K$, is a loose lower bound of $I(c;\tau_c)$ and leads to lower performance embeddings. $I_{\text{SaMI}}(c;\pi_c;\tau_c)$ is a lower ground-truth MI, and $I_{\text{SaNCE}}(c;\pi_c;\tau_c)$ is a tighter lower bound.
Figure 4: A comparison of sample spaces for task $e_1$. Positive samples $\tau_{c_1}$ or $\tau_{c_1}^+$ are always from current task $e_1$. For SaNCE, in a task $e_k$ with embedding $c_k$, the positive skill $\pi_{c_k}^+$ conditions on $c_k$ and generates positive trajectories $\tau_{c_k}^+$, and the negative skill $\pi_{c_k}^-$ generates negative trajectories $\tau_{c_k}^-$. The top graphs show the relationship between $c$, $\pi_c$ and $\tau_c$.
Figure 5: A practical framework for using SaNCE in the meta-training phase. During meta-training, we sample trajectories from the replay buffer for off-policy training. Queries are generated by a context encoder $\psi$, which is updated with gradients from both the SaNCE loss $\mathcal{L}_{\text{SaNCE}}$ and the RL loss $\mathcal{L}_{RL}$. Negative/Positive embeddings are encoded by a momentum context encoder $\psi^*$, which is driven by a momentum update with the encoder $\psi$. During meta-testing, the meta-trained context encoder $\psi$ embeds the current trajectory, and the RL policy takes the embedding as input together with the state for adaptation within an episode.
...and 22 more figures

Theorems & Definitions (5)

Lemma 1
Definition 1: Skills
Definition 2: $\boldsymbol{K^*}$
Lemma 2
Proposition 1

Skill-aware Mutual Information Optimisation for Generalisation in Reinforcement Learning

TL;DR

Abstract

Skill-aware Mutual Information Optimisation for Generalisation in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (27)

Theorems & Definitions (5)