Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning
Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, Sergey Levine
TL;DR
This work identifies and analyzes an implicit under-parameterization phenomenon in bootstrapped value-based deep RL, where gradient descent learning on bootstrapped targets reduces the effective rank of the Q-network's feature representations, harming expressivity and performance. The authors provide both kernel-regression and deep-linear-network theoretical analyses showing rank decline across bootstrapping iterations, and corroborate these findings with extensive empirical evidence on Atari and Gym in offline and online settings. They further propose a rank-preserving penalty, L_p(Φ), which empirically improves offline performance across several games and demonstrates the potential of rank-aware objectives to mitigate data-efficiency challenges. While the penalty helps, it does not fully address the root cause, motivating future work on architectures and auxiliary losses that maintain expressive capacity during bootstrapped learning. Overall, the paper sheds light on optimization-driven limitations of bootstrapped deep RL and offers practical mitigation avenues with implications for offline and data-efficient RL methods.
Abstract
We identify an implicit under-parameterization phenomenon in value-based deep RL methods that use bootstrapping: when value functions, approximated using deep neural networks, are trained with gradient descent using iterated regression onto target values generated by previous instances of the value network, more gradient updates decrease the expressivity of the current value network. We characterize this loss of expressivity via a drop in the rank of the learned value network features, and show that this typically corresponds to a performance drop. We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings. We formally analyze this phenomenon and show that it results from a pathological interaction between bootstrapping and gradient-based optimization. We further show that mitigating implicit under-parameterization by controlling rank collapse can improve performance.
