Table of Contents
Fetching ...

The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions

Nishil Patel, Sebastian Lee, Stefano Sarao Mannelli, Sebastian Goldt, Andrew Saxe

TL;DR

The paper tackles the challenge of understanding policy-learning dynamics in high-dimensional reinforcement learning by introducing the RL Perceptron, a solvable online policy-learning model. By deriving closed-form ordinary differential equations for two order parameters ($R$ and $Q$) that track student-teacher alignment, it provides a precise description of learning trajectories under various reward schemes, including sparse, delayed, and dense rewards, and enables optimal learning-rate and curriculum schedules. The work reveals rich phenomena—phase diagrams with easy and hybrid-hard regimes, a speed-accuracy trade-off controlled by reward stringency, and critical slowing down near transitions—supported by experiments on Bossfight and Pong that exhibit similar trends. This framework closes part of the gap between theory and practice in high-dimensional RL and offers a foundation for extending to more realistic policies and value-function learning, as well as curricula-driven training strategies.

Abstract

Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.

The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions

TL;DR

The paper tackles the challenge of understanding policy-learning dynamics in high-dimensional reinforcement learning by introducing the RL Perceptron, a solvable online policy-learning model. By deriving closed-form ordinary differential equations for two order parameters ( and ) that track student-teacher alignment, it provides a precise description of learning trajectories under various reward schemes, including sparse, delayed, and dense rewards, and enables optimal learning-rate and curriculum schedules. The work reveals rich phenomena—phase diagrams with easy and hybrid-hard regimes, a speed-accuracy trade-off controlled by reward stringency, and critical slowing down near transitions—supported by experiments on Bossfight and Pong that exhibit similar trends. This framework closes part of the gap between theory and practice in high-dimensional RL and offers a foundation for extending to more realistic policies and value-function learning, as well as curricula-driven training strategies.

Abstract

Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.
Paper Structure (21 sections, 30 equations, 11 figures, 1 table)

This paper contains 21 sections, 30 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: The RL-Perceptron is a model for policy learning in high dimensions.(a) In the classic teacher-student model for supervised learning, a neural network called the student is trained on inputs $x$ whose label $y^*$ is given by another neural network, called the teacher. (b) In the RL setting the student moves through states $s_t$ making a series of $T$ choices given in response to inputs $x_t$. The RL-perceptron is an extension of the teacher-student model as we assume there is a 'right' choice $y_t$ on each timestep given by a teacher network. The student receives a reward after $T$ decisions according to a criterion $\Phi$ that depends on the choices made and the corresponding correct choices. (c) Example learning dynamics in the RL-perceptron for a problem with $T=12$ choices where the reward is given only if all the decisions are correct. The plot shows the expected reward of a student trained in the RL perceptron setting in simulations (solid) and for our theoretical results (dashed) obtained from solving the dynamical equations \ref{['eq:Rall']} and \ref{['eq:Qall']}. Finite size simulations and theory show good agreement. We reduce the stochastic evolution of the high dimensional student to the study of deterministic evolution of two scalar quantities $R$ and $Q$ (more details in Sec. \ref{['sec:ode']}), their evolution are shown in the inset. Parameters: $D=900$, $\eta_1=1$, $\eta_2=0$, $T=12$.
  • Figure 2: ODEs accurately describe diverse learning protocols. Evolution of the normalised student-teacher overlap $\rho$ for the numerical solution of the ODEs (dashed) and simulation (coloured) in three reward protocols. All students receive a reward of $\eta_1$ for getting all decisions in an episode correct, and additionally: (a) A penalty $\eta_2$ (i.e. negative reward) is received if the agent does not survive until the end of an episode. (b) An additional reward of 0.2 is received at timestep $T_0$ if the agent survives beyond $T_0$ timesteps. (c) An additional reward $r_b$ is received at timestep $t$ for every correct decision $y_t$ made in an episode. (d) Episode length $T$ is varied.Parameters: $D=900$, $T=11$, $\eta_1=1$.
  • Figure 3: Optimal schedules for episode length $T$ and learning rate $\eta$. (a) Evolution of the normalised overlap under optimal episode length scheduling (dashed) and various constant episode lengths (green). (b) Evolution of the normalised overlap under optimal learning rate scheduling (dashed) and various constant learning rates (blue). (c) Evolution of optimal $T$ (green) and $\eta$ (blue) over learning. Parameters: $D=900$, $Q=1$, $\eta_2=0$, (a) $\eta_1=1$, (b) $T=8$.
  • Figure 4: Phase plots of learnability. In the case where all decisions in an episode of length $T$ must be correct. (a) the fixed points of $\rho$ for $T=13$ and $\eta_1=1$, the dashed portion of the line denotes where the fixed points are unstable. (b) Phase plot showing regions of hardness for $T=13$. (c) Phase plot showing regions of hardness for $T=8$. Green regions represent the Easy phase where with probability 1 the algorithm naturally converges to the optimal $\rho_{\text{fix}}$ from random initialisation. The orange region indicates the Hybrid-hard phase, where with high probability the algorithm converges to the sub-optimal $\rho_{\text{fix}}$ from random initialisation. Parameters: $D=900$, $Q=1$.
  • Figure 5: Speed accuracy tradeoff Evolution of the normalised overlap between student and teacher for simulation (solid) and ODE solution (dashed) for the case where $n$ or more decisions in an episode of $T=13$ are required correct for an update with $\eta_2 = 0$. More stringent reward conditions slow learning but can improve performance. Parameters: $D = 900, \eta_1 = 1, \eta_2 = 0$.
  • ...and 6 more figures