On Double Descent in Reinforcement Learning with LSTD and Random Features

David Brellmann; Eloïse Berthier; David Filliat; Goran Frehse

On Double Descent in Reinforcement Learning with LSTD and Random Features

David Brellmann, Eloïse Berthier, David Filliat, Goran Frehse

TL;DR

This paper investigates the behavior of temporal-difference learning in reinforcement learning under severe overparameterization by introducing a double asymptotic regime where the number of parameters $N$ and distinct visited states $m$ grow to infinity at a fixed ratio. The authors study regularized least-squares TD with random features in the lazy training regime, deriving deterministic equivalents for both the empirical and true Mean-Squared Bellman Error (MSBE) that include resolvent-based correction terms responsible for a double descent around the interpolation threshold $N/m=1$. They show these correction terms vanish as the regularization $\lambda$ increases or as all states become visited, and they validate the theory with experiments on synthetic MRPs and Taxi-v3 that closely match the predicted behavior. The work connects random-feature and resolvent techniques from supervised learning to reinforcement learning, offering a principled understanding of how model complexity and regularization shape TD performance in RL settings. Overall, it provides a rigorous accounting of when overparameterization helps or hurts in TD methods and identifies key determinants of double descent in MSBE and MSVE, with practical implications for selecting regularization and state coverage in RL tasks.

Abstract

Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Squared Bellman Error (MSBE) that feature correction terms responsible for the double descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.

On Double Descent in Reinforcement Learning with LSTD and Random Features

TL;DR

and distinct visited states

grow to infinity at a fixed ratio. The authors study regularized least-squares TD with random features in the lazy training regime, deriving deterministic equivalents for both the empirical and true Mean-Squared Bellman Error (MSBE) that include resolvent-based correction terms responsible for a double descent around the interpolation threshold

. They show these correction terms vanish as the regularization

increases or as all states become visited, and they validate the theory with experiments on synthetic MRPs and Taxi-v3 that closely match the predicted behavior. The work connects random-feature and resolvent techniques from supervised learning to reinforcement learning, offering a principled understanding of how model complexity and regularization shape TD performance in RL settings. Overall, it provides a rigorous accounting of when overparameterization helps or hurts in TD methods and identifies key determinants of double descent in MSBE and MSVE, with practical implications for selecting regularization and state coverage in RL tasks.

Abstract

-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Squared Bellman Error (MSBE) that feature correction terms responsible for the double descent. Correction terms vanish when the

-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.

Paper Structure (54 sections, 56 theorems, 420 equations, 14 figures, 1 table)

This paper contains 54 sections, 56 theorems, 420 equations, 14 figures, 1 table.

Introduction
Contributions.
Related Work
Neural Tangent Kernel (NTK) regime.
Mean-Field regime.
Double Asymptotic regime.
Preliminaries
Notations.
Markov Reward Processes.
Linear Function Approximation.
Linear Temporal-Difference Methods.
System Model
Regularized LSTD with Random Features
Random Features.
Sample Matrices and Empirical MSBE.
...and 39 more sections

Key Result

Theorem 5.1

Under Assumptions assumption:growth_rate (double asymptotic regime) and assumption:regime_n (bounded spectrum), let $\lambda >0$ and let the deterministic resolvent$\bar{{\bm{Q}}}_m(\lambda) \in \mathbb{R}^{n \times n}$ be where the deterministic Gram feature matrix${\bm{\Phi}}_{\hat{\mathcal{S}}} \in \mathbb{R}^{m \times m}$ is and the correction factor$\delta$ is the unique, positive, solution

Figures (14)

Figure 1: As the model complexity $N/m$ (for $N$ parameters, $m$ distinct visited states) increases, the MSBE first shows a U-shaped curve, peaking around the interpolation threshold ($N=m$). Double descent refers to the phenomenon for $N/m>1$ where the MSBE drops once again. Continuous lines (red) indicate the theoretical values from Theorem \ref{['theorem:asy-behavior-true-MSBE']}, the crosses (blue) are numerical results averaged over $30$ instances with their standard deviations after the learning with regularized LSTD on Taxi-v3 for $\gamma=0.95, \lambda=10^{-9}, n=5000, m=310$.
Figure 2: The correction factor $\delta$ is a decreasing function of the number of parameters $N$. For a small $l_2$-regularization parameter $\lambda$, we observe a sharp decrease near the interpolation threshold ($N=m$ for $m$ distinct visited states). As $\lambda$ increases, the function becomes smoother and smaller (note the different scales of the y-axis).$\delta$ is computed with equation \ref{['def:delta']} on Taxi-v3 with $\gamma=0.95, m=310, n=5000$.
Figure 3: The double descent phenomenon occurs in the true $\operatorname{MSBE}$ (red) of regularized LSTD, peaking around the interpolation threshold ($N=m$ for $N$ parameters, $m$ distinct visited states) when the empirical $\widehat{\operatorname{MSBE}}$ (blue) vanishes. It diminishes as the $l_2$-regularization parameter $\lambda$ increases. Continuous lines indicate the theoretical values from Theorem \ref{['theorem:asy-behavior-MSBE']} and Theorem \ref{['theorem:asy-behavior-true-MSBE']}, the crosses are numerical results averaged over $30$ instances after the learning with regularized LSTD in Taxi-v3 with $\gamma=0.95, m=310, n=5000$.
Figure 4: With more distinct states $m$ visited, the double descent in the MSBE diminishes, disappearing for $m = \lvert \mathcal{S} \rvert$.$\operatorname{MSBE}$ from Theorem \ref{['theorem:asy-behavior-true-MSBE']} (lines) and avg. numerical results over $30$ instances (crosses) in a synthetic ergodic MRP for $m=0.86\lvert \mathcal{S} \rvert$ (purple), $m=0.998\lvert \mathcal{S} \rvert$ (maroon), and $m=\lvert \mathcal{S} \rvert$ (green) with $\gamma=0.95, s=\lvert \mathcal{S} \rvert, n=3000$.
Figure 5: The discount factor $\gamma$ has little effect on the double descent in the MSBE. Results in the Gridworld MRP for $\gamma =0$ (purple), $\gamma=0.5$ (maroon), $\gamma=0.95$ (green), and $\gamma=0.99$ (orange) with $m=386, n=5000$. $\operatorname{MSBE}$ from Theorem \ref{['theorem:asy-behavior-true-MSBE']} (lines) and avg. numerical results over $30$ instances (crosses). balaavlskjsdhfjshjfsfwcbsbcsqbc
...and 9 more figures

Theorems & Definitions (117)

Theorem 5.1: Asymptotic Deterministic Resolvent
Remark 1
Remark 2
Remark 3
Theorem 5.2: Asymptotic Empirical MSBE
Remark 4
Remark 5
Theorem 5.3: Asymptotic MSBE
Remark 6
Remark 7
...and 107 more

On Double Descent in Reinforcement Learning with LSTD and Random Features

TL;DR

Abstract

On Double Descent in Reinforcement Learning with LSTD and Random Features

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (117)