Table of Contents
Fetching ...

Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

Debashis Chatterjee

Abstract

Offline reinforcement learning (RL) aims to learn decision policies from a fixed batch of logged transitions, without additional environment interaction. Despite remarkable empirical progress, offline RL remains fragile under distribution shifts: value-based methods can overestimate the value of unseen actions, yielding policies that exploit model errors rather than genuine long-term rewards. We propose \emph{Bayesian Conservative Policy Optimization (BCPO)}, a unified framework that converts epistemic uncertainty into \emph{provably conservative} policy improvement. BCPO maintains a hierarchical Bayesian posterior over environment/value models, constructs a \emph{credible lower bound} (LCB) on action values, and performs policy updates under explicit KL regularization toward the behavior distribution. This yields an uncertainty-calibrated analogue of conservative policy iteration in the offline regime. We provide a finite-MDP theory showing that the pessimistic fixed point lower-bounds the true value function with high probability and that KL-controlled updates improve a computable return lower bound. Empirically, we verify the methodology on a real offline replay dataset for the CartPole benchmark obtained via the \texttt{d3rlpy} ecosystem, and report diagnostics that link uncertainty growth and policy drift to offline instability, motivating principled early stopping and calibration

Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

Abstract

Offline reinforcement learning (RL) aims to learn decision policies from a fixed batch of logged transitions, without additional environment interaction. Despite remarkable empirical progress, offline RL remains fragile under distribution shifts: value-based methods can overestimate the value of unseen actions, yielding policies that exploit model errors rather than genuine long-term rewards. We propose \emph{Bayesian Conservative Policy Optimization (BCPO)}, a unified framework that converts epistemic uncertainty into \emph{provably conservative} policy improvement. BCPO maintains a hierarchical Bayesian posterior over environment/value models, constructs a \emph{credible lower bound} (LCB) on action values, and performs policy updates under explicit KL regularization toward the behavior distribution. This yields an uncertainty-calibrated analogue of conservative policy iteration in the offline regime. We provide a finite-MDP theory showing that the pessimistic fixed point lower-bounds the true value function with high probability and that KL-controlled updates improve a computable return lower bound. Empirically, we verify the methodology on a real offline replay dataset for the CartPole benchmark obtained via the \texttt{d3rlpy} ecosystem, and report diagnostics that link uncertainty growth and policy drift to offline instability, motivating principled early stopping and calibration
Paper Structure (85 sections, 14 theorems, 85 equations, 9 figures, 2 tables)

This paper contains 85 sections, 14 theorems, 85 equations, 9 figures, 2 tables.

Key Result

Lemma 6.1

Fix $(s,a)$ with $n(s,a)\ge 1$ and assume $r\in[0,1]$. Then with probability at least $1-\delta_{r}(s,a)$ (over the reward draws in $\mathcal{D}$ conditional on the visited $(s,a)$),

Figures (9)

  • Figure 1: Learning curves of offline RL algorithms. BCPO consistently achieves significantly higher expected return compared with both Behavior Cloning and naive FQI.
  • Figure 2: Relationship between dataset coverage and posterior uncertainty. As the number of observed transitions increases, the uncertainty bound decreases, validating the theoretical motivation for pessimistic evaluation.
  • Figure 3: State-value heatmaps learned by BCPO (left) and naive FQI (right). BCPO learns smooth value gradients toward the goal, whereas naive FQI produces unstable and localized value spikes due to extrapolation errors.
  • Figure 4: Greedy action maps derived from learned policies. BCPO generates a coherent navigation strategy toward the goal, while naive FQI produces inconsistent directional choices caused by inaccurate value estimates.
  • Figure 5: Learning curves on cartpole-replay. The BC line indicates near-ceiling performance. Naive offline DQN shows severe non-monotonicity and collapse after initially reaching high returns. BCPO initially reaches near-ceiling return but later deteriorates, motivating improved conservatism and/or early-stopping criteria.
  • ...and 4 more figures

Theorems & Definitions (35)

  • Definition 5.1: Hierarchical prior
  • Definition 5.2: Posterior mean and variance of the critic
  • Definition 5.3: Lower credible bound of $Q$
  • Definition 6.1: Empirical and posterior-mean transitions
  • Definition 6.2: LCB bonuses
  • Definition 6.3: Pessimistic Bellman operator
  • Remark 6.1
  • Lemma 6.1: Reward LCB via Hoeffding
  • proof
  • Lemma 6.2: $\ell_1$-concentration of Dirichlet posterior mean
  • ...and 25 more