SPQR: Controlling Q-ensemble Independence with Spiked Random Model for Reinforcement Learning
Dohyeok Lee, Seungyub Han, Taehyun Cho, Jungwoo Lee
TL;DR
SPQR tackles Q-ensemble overestimation bias by enforcing independence through a tractable KL divergence between the Q-ensemble's eigenvalue distribution and the Wigner semicircle. Grounded in random matrix theory via the spiked Wishart model, it provides a universal spectral regularizer without assuming distribution of Q-functions. Empirically, SPQR improves online and offline RL performance across MuJoCo, D4RL Gym, Franka Kitchen, and Antmaze, while reducing spectral spikes, and it remains computationally efficient and compatible with existing diversification methods. This work advances both theoretical guarantees and practical gains for robust ensemble Q-learning.
Abstract
Alleviating overestimation bias is a critical challenge for deep reinforcement learning to achieve successful performance on more complex tasks or offline datasets containing out-of-distribution data. In order to overcome overestimation bias, ensemble methods for Q-learning have been investigated to exploit the diversity of multiple Q-functions. Since network initialization has been the predominant approach to promote diversity in Q-functions, heuristically designed diversity injection methods have been studied in the literature. However, previous studies have not attempted to approach guaranteed independence over an ensemble from a theoretical perspective. By introducing a novel regularization loss for Q-ensemble independence based on random matrix theory, we propose spiked Wishart Q-ensemble independence regularization (SPQR) for reinforcement learning. Specifically, we modify the intractable hypothesis testing criterion for the Q-ensemble independence into a tractable KL divergence between the spectral distribution of the Q-ensemble and the target Wigner's semicircle distribution. We implement SPQR in several online and offline ensemble Q-learning algorithms. In the experiments, SPQR outperforms the baseline algorithms in both online and offline RL benchmarks.
