Table of Contents
Fetching ...

Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, Sergey Levine

TL;DR

This work identifies and analyzes an implicit under-parameterization phenomenon in bootstrapped value-based deep RL, where gradient descent learning on bootstrapped targets reduces the effective rank of the Q-network's feature representations, harming expressivity and performance. The authors provide both kernel-regression and deep-linear-network theoretical analyses showing rank decline across bootstrapping iterations, and corroborate these findings with extensive empirical evidence on Atari and Gym in offline and online settings. They further propose a rank-preserving penalty, L_p(Φ), which empirically improves offline performance across several games and demonstrates the potential of rank-aware objectives to mitigate data-efficiency challenges. While the penalty helps, it does not fully address the root cause, motivating future work on architectures and auxiliary losses that maintain expressive capacity during bootstrapped learning. Overall, the paper sheds light on optimization-driven limitations of bootstrapped deep RL and offers practical mitigation avenues with implications for offline and data-efficient RL methods.

Abstract

We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity via a drop in the rank of the learned value network features, and show that this typically corresponds to a performance drop. We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance.

Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning

TL;DR

This work identifies and analyzes an implicit under-parameterization phenomenon in bootstrapped value-based deep RL, where gradient descent learning on bootstrapped targets reduces the effective rank of the Q-network's feature representations, harming expressivity and performance. The authors provide both kernel-regression and deep-linear-network theoretical analyses showing rank decline across bootstrapping iterations, and corroborate these findings with extensive empirical evidence on Atari and Gym in offline and online settings. They further propose a rank-preserving penalty, L_p(Φ), which empirically improves offline performance across several games and demonstrates the potential of rank-aware objectives to mitigate data-efficiency challenges. While the penalty helps, it does not fully address the root cause, motivating future work on architectures and auxiliary losses that maintain expressive capacity during bootstrapped learning. Overall, the paper sheds light on optimization-driven limitations of bootstrapped deep RL and offers practical mitigation avenues with implications for offline and data-efficient RL methods.

Abstract

We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity via a drop in the rank of the learned value network features, and show that this typically corresponds to a performance drop. We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance.

Paper Structure

This paper contains 33 sections, 11 theorems, 72 equations, 31 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.1

Let $\mathbf{S}$ be a shorthand for $\mathbf{S} = \gamma P^\pi \mathbf{A}$ and assume $\mathbf{S}$ is a normal matrix. Then there exists an infinite, strictly increasing sequence of fitting iterations, $(k_l)_{l=1}^\infty$ starting from $k_1 = 0$, such that, for any two singular-values $\sigma_i(\ma Hence, $\text{srank}_{\delta}(\mathbf{M}_{k_{l^\prime}}) \leq \text{srank}_\delta(\mathbf{M}_{k_l})

Figures (31)

  • Figure 1: Implicit under-parameterization. Schematic diagram depicting the emergence of an effective rank collapse in deep Q-learning. Minimizing TD errors using gradient descent with deep neural network Q-function leads to a collapse in the effective rank of the learned features $\Phi$, which is exacerbated with further training.
  • Figure 2: Offline RL. $\text{srank}_\delta(\Phi)$ and performance of neural FQI on gridworld, DQN on Atari and SAC on Gym environments in the offline RL setting. Note that low rank (top row) generally corresponds to worse policy performance (bottom row). Rank collapse is worse with more gradient steps per fitting iteration (T$=10$vs. $200$ on gridworld). Even when a larger, high coverage dataset is used, marked as DQN (4x data), rank collapse occurs (for Asterix also see Figure \ref{['fig:offline_problem_app_20k']} for a complete figure with a larger number of gradient updates).
  • Figure 3: Data Efficient Online RL. $\text{srank}_\delta(\Phi)$ and performance of neural FQI on gridworld, DQN on Atari and SAC on Gym domains in the online RL setting, with varying numbers of gradient steps per environment step ($n$). Rank collapse happens earlier with more gradient steps, and the corresponding performance is poor.
  • Figure 4: (a) Fitting error for $Q^*$ prediction for $n\!=\!10$ vs $n\!=\!200$ steps in Figure \ref{['fig:online_problem']} (left). Observe that rank collapse inhibits fitting $Q^*$ as the fitting error rises over training while rank collapses. (b) TD error for varying values of $n$ for Seaquest in Figure \ref{['fig:online_problem']} (middle). TD error increases with rank degradation. (c) Q-network re-initialization in each fitting iteration on gridworld. (d) Trend of $\text{srank}_\delta(\Phi)$ for policy evaluation based on bootstrapped updates (FQE) vs Monte-Carlo returns (no bootstrapping). Note that rank-collapse still persists with reinitialization and FQE, but goes away in the absence of bootstrapping.
  • Figure 5: Trend of $\text{srank}_\delta(\Phi)$v.s. error on log scale to the projected TD fixed point. A drop in $\text{srank}_\delta(\Phi)$ (shown as blue and yellow circles) corresponds to a corresponding increase in distance to the fixed point.
  • ...and 26 more figures

Theorems & Definitions (17)

  • Definition 1
  • Theorem 4.1
  • Proposition 4.1
  • Proposition 4.2
  • Theorem 4.2
  • Lemma C.0.1
  • proof
  • Theorem C.1
  • Lemma C.1.1: $\text{srank}_\delta(\mathbf{M}_k)$ decreases when $\mathbf{S}^k$ is PSD.
  • proof
  • ...and 7 more