Table of Contents
Fetching ...

How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning

Max Weltevrede, Moritz A. Zanger, Matthijs T. J. Spaan, Wendelin Böhmer

TL;DR

The paper addresses zero-shot generalisation in reinforcement learning by analysing policy distillation after training under a generalisation through invariance framework (GTI-ZSPT). It proves a bound showing that distilling an ensemble of policies on diverse training data reduces the gap to the optimal policy in unseen contexts, with the bound improving as ensemble size grows and as the symmetry subgroup better covers the full symmetry group. Empirically, the authors validate the theory beyond its strict assumptions, demonstrating that ensembles distilled on more diverse data can outperform the original agent in tasks like Reacher with rotational symmetry and Four Rooms. They also extend insights to behaviour cloning, underscoring the practical impact of data diversity and model ensembles for generalisation. Overall, the work provides both theoretical guidance and empirical evidence that policy distillation, particularly with ensembles and diverse datasets, is a powerful tool to enhance zero-shot RL generalisation. The results highlight practical strategies for improving robustness to unseen contexts with modest additional distillation effort.

Abstract

In the zero-shot policy transfer setting in reinforcement learning, the goal is to train an agent on a fixed set of training environments so that it can generalise to similar, but unseen, testing environments. Previous work has shown that policy distillation after training can sometimes produce a policy that outperforms the original in the testing environments. However, it is not yet entirely clear why that is, or what data should be used to distil the policy. In this paper, we prove, under certain assumptions, a generalisation bound for policy distillation after training. The theory provides two practical insights: for improved generalisation, you should 1) train an ensemble of distilled policies, and 2) distil it on as much data from the training environments as possible. We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.

How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning

TL;DR

The paper addresses zero-shot generalisation in reinforcement learning by analysing policy distillation after training under a generalisation through invariance framework (GTI-ZSPT). It proves a bound showing that distilling an ensemble of policies on diverse training data reduces the gap to the optimal policy in unseen contexts, with the bound improving as ensemble size grows and as the symmetry subgroup better covers the full symmetry group. Empirically, the authors validate the theory beyond its strict assumptions, demonstrating that ensembles distilled on more diverse data can outperform the original agent in tasks like Reacher with rotational symmetry and Four Rooms. They also extend insights to behaviour cloning, underscoring the practical impact of data diversity and model ensembles for generalisation. Overall, the work provides both theoretical guidance and empirical evidence that policy distillation, particularly with ensembles and diverse datasets, is a powerful tool to enhance zero-shot RL generalisation. The results highlight practical strategies for improving robustness to unseen contexts with modest additional distillation effort.

Abstract

In the zero-shot policy transfer setting in reinforcement learning, the goal is to train an agent on a fixed set of training environments so that it can generalise to similar, but unseen, testing environments. Previous work has shown that policy distillation after training can sometimes produce a policy that outperforms the original in the testing environments. However, it is not yet entirely clear why that is, or what data should be used to distil the policy. In this paper, we prove, under certain assumptions, a generalisation bound for policy distillation after training. The theory provides two practical insights: for improved generalisation, you should 1) train an ensemble of distilled policies, and 2) distil it on as much data from the training environments as possible. We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.

Paper Structure

This paper contains 40 sections, 13 theorems, 71 equations, 6 figures, 9 tables.

Key Result

Theorem 1

Let $\pi^*$ be the optimal policy and $\pi_\theta$ be the student policy. If the MDP is $(L_T, L_R)$-Lipschitz continuous and the optimal and student policies are $L_\pi$-Lipschitz continuous, and we have that $\gamma L_T (1 + L_{\pi}) < 1$, then it holds that: where $d^{\pi^*}(s) = (1-\gamma) \sum_{t=0}^\infty \gamma^t \mathbb{P}(s_t=s|\pi^*, p_0)$ the $\gamma$-discounted visitation distribution

Figures (6)

  • Figure 1: A 'Reacher with rotational symmetry' CMDP with four training contexts, differing in the location of the shoulder (red), positioned along a circle (dotted line). All contexts share the relative pose of the robot arm (blue). The goal is for the hand (black circle) to reach the goal location (green circle) in the middle. The training contexts can be generated by applying the group of $90^\circ$ rotations to context 1, and the testing contexts can be generated with the full group of rotations ($SO(2)$).
  • Figure 2: The base context set in the illustrative reacher CMDP with varying shoulder location (red) and robot arm pose (blue), see Figure \ref{['fig:illustrative']} for details.
  • Figure 3: The Training Contexts and Training Contexts + $C_4$ context sets in the 'Reacher with rotational symmetry' reacher CMDP with varying shoulder location (red) and robot arm pose (blue), see Figure \ref{['fig:illustrative']} for details.
  • Figure 4: Example of Four Rooms training and testing contexts.
  • Figure 5: Test return (left axis) compared with the total variation (trace of the covariance matrix) over orbits of the $SO(2)$ group of rotations (right axis) for (a) different ensemble sizes and (b) subgroups $B \le SO(2)$. The total variation is a measure of how invariant the agent has become with respect to rotations, zero total variation would correspond to perfect invariance. Shown are the mean and 95% confidence intervals over 20 seeds.
  • ...and 1 more figures

Theorems & Definitions (32)

  • Theorem : Theorem 3
  • proof
  • Theorem : Lemma 6.2
  • proof
  • Definition 1: Generalisation through invariance ZSPT
  • Definition 2
  • Theorem 1
  • proof
  • Definition 2: Generalisation through invariance ZSPT
  • Definition 2
  • ...and 22 more