The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions
Nishil Patel, Sebastian Lee, Stefano Sarao Mannelli, Sebastian Goldt, Andrew Saxe
TL;DR
The paper tackles the challenge of understanding policy-learning dynamics in high-dimensional reinforcement learning by introducing the RL Perceptron, a solvable online policy-learning model. By deriving closed-form ordinary differential equations for two order parameters ($R$ and $Q$) that track student-teacher alignment, it provides a precise description of learning trajectories under various reward schemes, including sparse, delayed, and dense rewards, and enables optimal learning-rate and curriculum schedules. The work reveals rich phenomena—phase diagrams with easy and hybrid-hard regimes, a speed-accuracy trade-off controlled by reward stringency, and critical slowing down near transitions—supported by experiments on Bossfight and Pong that exhibit similar trends. This framework closes part of the gap between theory and practice in high-dimensional RL and offers a foundation for extending to more realistic policies and value-function learning, as well as curricula-driven training strategies.
Abstract
Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.
