Table of Contents
Fetching ...

SPO: Sequential Monte Carlo Policy Optimisation

Matthew V Macfarlane, Edan Toledo, Donal Byrne, Paul Duckworth, Alexandre Laterre

TL;DR

This paper introduces SPO: Sequential Monte Carlo Policy Optimisation, a model-based reinforcement learning algorithm grounded within the Expectation Maximisation (EM) framework and shows that SPO provides robust policy improvement and efficient scaling properties.

Abstract

Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However, these methods typically face scaling challenges due to the sequential nature of their search. While practical engineering solutions can partly overcome this, they often result in a negative impact on performance. In this paper, we introduce SPO: Sequential Monte Carlo Policy Optimisation, a model-based reinforcement learning algorithm grounded within the Expectation Maximisation (EM) framework. We show that SPO provides robust policy improvement and efficient scaling properties. The sample-based search makes it directly applicable to both discrete and continuous action spaces without modifications. We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. Furthermore, the parallel nature of SPO's search enables effective utilisation of hardware accelerators, yielding favourable scaling laws.

SPO: Sequential Monte Carlo Policy Optimisation

TL;DR

This paper introduces SPO: Sequential Monte Carlo Policy Optimisation, a model-based reinforcement learning algorithm grounded within the Expectation Maximisation (EM) framework and shows that SPO provides robust policy improvement and efficient scaling properties.

Abstract

Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However, these methods typically face scaling challenges due to the sequential nature of their search. While practical engineering solutions can partly overcome this, they often result in a negative impact on performance. In this paper, we introduce SPO: Sequential Monte Carlo Policy Optimisation, a model-based reinforcement learning algorithm grounded within the Expectation Maximisation (EM) framework. We show that SPO provides robust policy improvement and efficient scaling properties. The sample-based search makes it directly applicable to both discrete and continuous action spaces without modifications. We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. Furthermore, the parallel nature of SPO's search enables effective utilisation of hardware accelerators, yielding favourable scaling laws.
Paper Structure (55 sections, 1 theorem, 28 equations, 15 figures, 11 tables, 3 algorithms)

This paper contains 55 sections, 1 theorem, 28 equations, 15 figures, 11 tables, 3 algorithms.

Key Result

Proposition 1

Given a non-parametric variational distribution $q_i$ and a parametric policy $\pi_{\theta_i}$. Given $q_{i+1}$, the analytical solution to E-step optimisation eq:optimisation , and $\pi_{\theta_{i+1}}$, the solution to maximisation problem in the M-step eq:m-step then the ELBO $\mathcal{J}$ is guar

Figures (15)

  • Figure 1: SPO search: $n$ rollouts, represented by particles $x^{i}, \dots, x^{n}$, each of which represents an SMC trajectory sample, are performed in parallel according to $\pi_i$ (left to right). At each environment step, the weights of the particles are adjusted (indicated in the diagram by circle size). We show two resampling regions where particles are resampled, favouring those with higher weights, and their weights are reset. The target distribution is estimated from the initial actions of the surviving particles (rightmost particles). This target estimate, $q_{i}$, is then used to update $\pi$ in the M-step.
  • Figure 2: Learning curves for discrete and continuous environments. The Y-axis represents the interquartile mean of min-max normalised scores, with shaded regions indicating 95% confidence intervals, across 5 random seeds.
  • Figure 3: (left) Scaling: Mean normalised performance across all continuous environments on $10^8$ environment steps, varying particle numbers $N$ and horizon $h$ for SPO during training. (right) Wall Clock Time Comparison: Performance on Rubik's cube plotted against wall-clock time for AlphaZero and 3 versions of SPO (varying by SMC search depth), with total search budget labeled at each point.
  • Figure 4: Example of a Boxoban Problem
  • Figure 5: Brax
  • ...and 10 more figures

Theorems & Definitions (1)

  • Proposition 1