Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

Xudong Yu; Chenjia Bai; Hongyi Guo; Changhong Wang; Zhen Wang

Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

Xudong Yu, Chenjia Bai, Hongyi Guo, Changhong Wang, Zhen Wang

TL;DR

This paper tackles distributional shift in offline RL by learning reliable uncertainty over Q-values with minimal ensembles. It introduces Diverse Randomized Value Functions (DRVF), which combine Bayesian last-layer neural networks with ensemble methods and a repulsive regularization to approximate the Q-posterior and produce a provably pessimistic LCB penalty, especially under linear MDP assumptions. The approach yields competitive or superior performance on D4RL benchmarks with markedly better parametric efficiency, and demonstrates robust uncertainty quantification that aligns higher uncertainty with OOD actions. Theoretical results connect the DRVF posterior sampling to efficient pessimism in linear settings, and empirical evidence shows DRVF's practicality for offline policy learning with reduced computational burden. Overall, DRVF offers a principled, scalable framework for uncertainty-aware offline RL that mitigates extrapolation errors while using far fewer ensembles than prior uncertainty-based methods.

Abstract

Offline Reinforcement Learning (RL) faces distributional shift and unreliable value estimation, especially for out-of-distribution (OOD) actions. To address this, existing uncertainty-based methods penalize the value function with uncertainty quantification and demand numerous ensemble networks, posing computational challenges and suboptimal outcomes. In this paper, we introduce a novel strategy employing diverse randomized value functions to estimate the posterior distribution of $Q$-values. It provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of $Q$-values. By applying moderate value penalties for OOD actions, our method fosters a provably pessimistic approach. We also emphasize on diversity within randomized value functions and enhance efficiency by introducing a diversity regularization method, reducing the requisite number of networks. These modules lead to reliable value estimation and efficient policy learning from offline data. Theoretical analysis shows that our method recovers the provably efficient LCB-penalty under linear MDP assumptions. Extensive empirical results also demonstrate that our proposed method significantly outperforms baseline methods in terms of performance and parametric efficiency.

Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

TL;DR

Abstract

-values. It provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of

-values. By applying moderate value penalties for OOD actions, our method fosters a provably pessimistic approach. We also emphasize on diversity within randomized value functions and enhance efficiency by introducing a diversity regularization method, reducing the requisite number of networks. These modules lead to reliable value estimation and efficient policy learning from offline data. Theoretical analysis shows that our method recovers the provably efficient LCB-penalty under linear MDP assumptions. Extensive empirical results also demonstrate that our proposed method significantly outperforms baseline methods in terms of performance and parametric efficiency.

Paper Structure (41 sections, 9 theorems, 37 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 41 sections, 9 theorems, 37 equations, 10 figures, 8 tables, 1 algorithm.

Introduction
Preliminaries
Episodic MDP
Linear MDP
Related Work
Offline RL
Uncertainty Quantification
Bayesian Uncertainty from Randomized Value Functions
Approximation of the Q-posterior
The ELBO learning objective
LCB penalty from the ensemble BNNs
Pessimistic Q-learning
Repulsive Regularization for Diversity
Intuition Behind the Repulsive Term
Repulsive term for ensemble BNNs
...and 26 more sections

Key Result

Theorem 1

Under linear MDP assumptions, it holds for the standard deviation of the estimated posterior distribution $\mathbb P(\tilde{Q} \,|\, s, a, {\mathcal{D}}_m)$ that where $\Lambda_t = \sum\nolimits_{i \in [m]} \psi(s^i_t, a^i_t) \psi(s^i_t, a^i_t)^\top + \lambda \cdot \mathbf I$, and $\Gamma_t^{\rm lcb}(s_t,a_t)$ is defined as the LCB-term.

Figures (10)

Figure 1: The architecture of DRVF, where we utilize approximate Bayesian inference in the last layer of the critic network. We perform OOD sampling to obtain $(s,a^{\rm ood})$ pairs based on $(s,a)\sim {\mathcal{D}}_m$. We input $(s,a)$ and $(s,a^{\rm ood})$ pairs to obtain the $Q$-samples and the repulsive term, respectively. Then $(s,a)$ is used for the pessimistic update of the $Q$-values, and $(s,a^{\rm ood})$ provides regularization.
Figure 2: Visualization for the intuition behind the repulsive regularization term. Blank dots represent data points, while colorful lines denote possible estimates of the Q-function. The shaded area denotes the uncertainty of ensemble predictions. In panel (a), uncertainty quantification and estimates are obtained from a large number of ensembles, which provides reliable uncertainty estimation. In panel (b), the uncertainty and possible estimates come from a small number of ensembles, which fails to properly account for the uncertainty. In DRVF, we explicitly maximize the standard deviation of the samples from the ensemble BNNs to obtain reliable uncertainty quantification with fewer ensembles.
Figure 3: Uncertainty estimation of the in-distribution samples (white) and OOD samples (orange). The results (a-c) are obtained from DRVF, while the results (d-f) are estimated by PBRL. The brighter area in the contour indicates higher uncertainty, while the darker area indicates lower uncertainty. DRVF demonstrates a more favorable alignment of the brighter areas with the OOD samples, suggesting an enhanced ability to quantify uncertainty.
Figure 4: Minimum number of $Q$-ensembles ($M$) required to obtain the performance in Table \ref{['table1']}. DRVF needs much fewer ensembles than EDAC in most cases and reduces the required parameters.
Figure 5: Aggregate metrics on D4RL with 95% CIs based on Gym Mujoco tasks and 5 random seeds for each task. DRVF shows a higher median, mean, and IQM, and a lower optimality gap than other methods.
...and 5 more figures

Theorems & Definitions (16)

Theorem 1
Definition 1: $\xi$-Uncertainty Quantifier pevi-2021
Proposition 1
Corollary 1: Suboptimality Gap pevi-2021
Proposition 2: Informal
Theorem : Theorem 1 restate
proof
Definition : Definition 1 restate
Proposition : Proposition 1 restate
proof
...and 6 more

Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

TL;DR

Abstract

Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (16)