Table of Contents
Fetching ...

SPQR: Controlling Q-ensemble Independence with Spiked Random Model for Reinforcement Learning

Dohyeok Lee, Seungyub Han, Taehyun Cho, Jungwoo Lee

TL;DR

SPQR tackles Q-ensemble overestimation bias by enforcing independence through a tractable KL divergence between the Q-ensemble's eigenvalue distribution and the Wigner semicircle. Grounded in random matrix theory via the spiked Wishart model, it provides a universal spectral regularizer without assuming distribution of Q-functions. Empirically, SPQR improves online and offline RL performance across MuJoCo, D4RL Gym, Franka Kitchen, and Antmaze, while reducing spectral spikes, and it remains computationally efficient and compatible with existing diversification methods. This work advances both theoretical guarantees and practical gains for robust ensemble Q-learning.

Abstract

Alleviating overestimation bias is a critical challenge for deep reinforcement learning to achieve successful performance on more complex tasks or offline datasets containing out-of-distribution data. In order to overcome overestimation bias, ensemble methods for Q-learning have been investigated to exploit the diversity of multiple Q-functions. Since network initialization has been the predominant approach to promote diversity in Q-functions, heuristically designed diversity injection methods have been studied in the literature. However, previous studies have not attempted to approach guaranteed independence over an ensemble from a theoretical perspective. By introducing a novel regularization loss for Q-ensemble independence based on random matrix theory, we propose spiked Wishart Q-ensemble independence regularization (SPQR) for reinforcement learning. Specifically, we modify the intractable hypothesis testing criterion for the Q-ensemble independence into a tractable KL divergence between the spectral distribution of the Q-ensemble and the target Wigner's semicircle distribution. We implement SPQR in several online and offline ensemble Q-learning algorithms. In the experiments, SPQR outperforms the baseline algorithms in both online and offline RL benchmarks.

SPQR: Controlling Q-ensemble Independence with Spiked Random Model for Reinforcement Learning

TL;DR

SPQR tackles Q-ensemble overestimation bias by enforcing independence through a tractable KL divergence between the Q-ensemble's eigenvalue distribution and the Wigner semicircle. Grounded in random matrix theory via the spiked Wishart model, it provides a universal spectral regularizer without assuming distribution of Q-functions. Empirically, SPQR improves online and offline RL performance across MuJoCo, D4RL Gym, Franka Kitchen, and Antmaze, while reducing spectral spikes, and it remains computationally efficient and compatible with existing diversification methods. This work advances both theoretical guarantees and practical gains for robust ensemble Q-learning.

Abstract

Alleviating overestimation bias is a critical challenge for deep reinforcement learning to achieve successful performance on more complex tasks or offline datasets containing out-of-distribution data. In order to overcome overestimation bias, ensemble methods for Q-learning have been investigated to exploit the diversity of multiple Q-functions. Since network initialization has been the predominant approach to promote diversity in Q-functions, heuristically designed diversity injection methods have been studied in the literature. However, previous studies have not attempted to approach guaranteed independence over an ensemble from a theoretical perspective. By introducing a novel regularization loss for Q-ensemble independence based on random matrix theory, we propose spiked Wishart Q-ensemble independence regularization (SPQR) for reinforcement learning. Specifically, we modify the intractable hypothesis testing criterion for the Q-ensemble independence into a tractable KL divergence between the spectral distribution of the Q-ensemble and the target Wigner's semicircle distribution. We implement SPQR in several online and offline ensemble Q-learning algorithms. In the experiments, SPQR outperforms the baseline algorithms in both online and offline RL benchmarks.
Paper Structure (42 sections, 13 theorems, 32 equations, 16 figures, 11 tables, 4 algorithms)

This paper contains 42 sections, 13 theorems, 32 equations, 16 figures, 11 tables, 4 algorithms.

Key Result

Theorem 4.1

Following definition def:spk-wis-def, as $N \rightarrow \infty$ with probability at least $1-\delta$, following test $T(\lambda)$ is optimal.

Figures (16)

  • Figure 1: Left 1--4: Histogram visualization of Q-value distribution over networks in the ensemble. Each plot represents a Q-value histogram for one state-action data. The X-axis represents the Q-value of each Q-network and the Y-axis represents the number of Q-networks in the histogram bin. The red horizontal line represents a uniform distribution and the blue solid line represents kernel density estimation for a given histogram. Rightmost: Heatmap visualization of the Pearson correlation coefficient matrix between each Q-network in the ensemble. Detailed values and explanations are given in Appendix \ref{['append: Further exp']}.
  • Figure 2: Eigenvalue plot for the spiked Wishart model with Wigner's semicircle law. For effective visualization of the spiked model, we use a complex hermitian random matrix, GUE. Blue dots represent each data in a complex plane. The red line represents Wigner's semicircle distribution. The blue line in the histogram represents kernel density estimation. Left 1--2: Perturbation power is $\psi=0$. Eigenvalue distribution follows Wigner's semicircle law. Right 3--4: Perturbation power is $\psi=10^{-5}$. Eigenvalue distribution almost follows Wigner's semicircle law except for the largest eigenvalue, interpreted as a spike.
  • Figure 3: Mean of predicted Q-value of SPQR-SAC-Min with various $\beta$ on hopper-random dataset, averaged over 4 seeds.
  • Figure 4: Illustrative example of SPQR.
  • Figure 5: Illustration for building Q-matrix.
  • ...and 11 more figures

Theorems & Definitions (26)

  • Definition 3.1: Gaussian Orthogonal Ensemble, ml1991mehta
  • Definition 3.2
  • Theorem 4.1
  • Definition A.1: Contiguity
  • Lemma A.2: Le Cam's First Lemma
  • Lemma A.3: Second moment method
  • Lemma A.4
  • Lemma A.5: Neyman-Pearson Lemma
  • Lemma A.5
  • Corollary A.6
  • ...and 16 more