Table of Contents
Fetching ...

Ray Interference: a Source of Plateaus in Deep Reinforcement Learning

Tom Schaul, Diana Borsa, Joseph Modayil, Razvan Pascanu

TL;DR

Deep RL with function approximation can exhibit ray interference, where negative interference between multiple objective components and coupling to future data generation create long learning plateaus. The authors analyze a minimal 2x2 bandit to derive exact continuous-time dynamics, identify saddle points, plateaus, and basins of attraction, and then generalize to factored objectives and RL contexts. They show that plateaus arise near saddles when interference is negative and learning is coupled to performance, and that removing either factor (interference or coupling) eliminates plateaus; plateaus intensify as more components are added. The work highlights a potential explanation for slow convergence in deep RL and suggests remedies such as decoupling representations, using off-policy data, or modular architectures, with implications for multi-task and continual learning.

Abstract

Rather than proposing a new method, this paper investigates an issue present in existing learning algorithms. We study the learning dynamics of reinforcement learning (RL), specifically a characteristic coupling between learning and data generation that arises because RL agents control their future data distribution. In the presence of function approximation, this coupling can lead to a problematic type of 'ray interference', characterized by learning dynamics that sequentially traverse a number of performance plateaus, effectively constraining the agent to learn one thing at a time even when learning in parallel is better. We establish the conditions under which ray interference occurs, show its relation to saddle points and obtain the exact learning dynamics in a restricted setting. We characterize a number of its properties and discuss possible remedies.

Ray Interference: a Source of Plateaus in Deep Reinforcement Learning

TL;DR

Deep RL with function approximation can exhibit ray interference, where negative interference between multiple objective components and coupling to future data generation create long learning plateaus. The authors analyze a minimal 2x2 bandit to derive exact continuous-time dynamics, identify saddle points, plateaus, and basins of attraction, and then generalize to factored objectives and RL contexts. They show that plateaus arise near saddles when interference is negative and learning is coupled to performance, and that removing either factor (interference or coupling) eliminates plateaus; plateaus intensify as more components are added. The work highlights a potential explanation for slow convergence in deep RL and suggests remedies such as decoupling representations, using off-policy data, or modular architectures, with implications for multi-task and continual learning.

Abstract

Rather than proposing a new method, this paper investigates an issue present in existing learning algorithms. We study the learning dynamics of reinforcement learning (RL), specifically a characteristic coupling between learning and data generation that arises because RL agents control their future data distribution. In the presence of function approximation, this coupling can lead to a problematic type of 'ray interference', characterized by learning dynamics that sequentially traverse a number of performance plateaus, effectively constraining the agent to learn one thing at a time even when learning in parallel is better. We establish the conditions under which ray interference occurs, show its relation to saddle points and obtain the exact learning dynamics in a restricted setting. We characterize a number of its properties and discuss possible remedies.

Paper Structure

This paper contains 30 sections, 29 equations, 9 figures.

Figures (9)

  • Figure 1: Illustration of ray interference in two objective component dimensions $J_1, J_2$. Top row: Arrows indicate the flow direction of the the learning trajectories. Each colored line is a (stochastic) sample trajectory, color-coded by performance. Bottom row: Matching learning curves for these same trajectories. Note how the trajectories that pass by the saddle points of the dynamics, at $(0,1)$ and $(1,0)$, in warm colors, hit plateaus and learn much slower (note that the scale of the x-axis differs per plot). Each column has a different setup. Left: RL with FA exhibits ray interference as it has both coupling and interference. Middle: Tabular RL has few plateaus because there is no interference in the dynamics. Right: Supervised learning has no plateaus even with FA interference.
  • Figure 2: Bandit learning dynamics: Geometric intuitions to accompany the derivations. The green hyperbolae show the null clines that enclose the WTA regions. Inflection points are shown in blue, of which the solid lines are plateaus ($\dddot{J}>0$), while the dashed lines are not. The orange path encloses the basin of attraction for a plateau of $\epsilon=0.1$. The red polygon is its lower-bound approximation for which the vertices can be derived explicitly (\ref{['sec:lower-bound']}).
  • Figure 3: Likelihood of encountering a flat plateau. This plot shows on the likelihood (vertical axis) that the slowest learning progress, $\min |\dot{J}|$, along a trajectory is below some value---when there is a plateau, this is its flatness (horizontal axis). For example, 20% of on-policy runs (red curve) traverse a very flat plateau with $\epsilon \leq 10^{-5}$. All these results are empirical quantiles, when starting at low initial performance, $J(\theta_0)=\frac{K}{10}$, and ignoring slow progress near the start or the optimum. There are four settings: ray interference (red) is a consequence of two ingredients, interference and coupling. Multiple ablations eliminate it: interference can be removed by training separate networks or using a tabular representation (green); coupling can be removed by off-policy RL with uniform exploration (blue) or a supervised learning setup as in \ref{['sec:supervised']} (yellow). One key contributing factor that impacts whether a trajectory is 'lucky' is whether it is initialized near the diagonal ($J_1(\theta_0) \approx J_2(\theta_0)$) or not: the more imbalanced the initial performance, the more likely it is to encounter a slow plateau.
  • Figure 4: Learning curves when scaling up the problem dimension (jointly $K$ and $n$). We observe that the $K=8$ runs go through more separate plateaus, and each plateau takes exponentially longer to overcome than the previous one (the horizontal axis is log-scale).
  • Figure 5: Basins of attraction, for plateaus of different $\epsilon$, and for different levels of initial performance $J(\theta_0)$, under deterministic dynamics. The dashed line indicates the typical $\epsilon$ for which learning is 10 times slower than necessary (see \ref{['fig:eps-badness']}), so for example half of the trajectories initialized at $J(\theta_0)=0.2$ hit such a flat plateau.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Definition 1: Interference
  • Definition 2: Plateaus
  • Definition 3: Winner-take-all