Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

Debashis Chatterjee

Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

Debashis Chatterjee

Abstract

Offline reinforcement learning (RL) aims to learn decision policies from a fixed batch of logged transitions, without additional environment interaction. Despite remarkable empirical progress, offline RL remains fragile under distribution shifts: value-based methods can overestimate the value of unseen actions, yielding policies that exploit model errors rather than genuine long-term rewards. We propose \emph{Bayesian Conservative Policy Optimization (BCPO)}, a unified framework that converts epistemic uncertainty into \emph{provably conservative} policy improvement. BCPO maintains a hierarchical Bayesian posterior over environment/value models, constructs a \emph{credible lower bound} (LCB) on action values, and performs policy updates under explicit KL regularization toward the behavior distribution. This yields an uncertainty-calibrated analogue of conservative policy iteration in the offline regime. We provide a finite-MDP theory showing that the pessimistic fixed point lower-bounds the true value function with high probability and that KL-controlled updates improve a computable return lower bound. Empirically, we verify the methodology on a real offline replay dataset for the CartPole benchmark obtained via the \texttt{d3rlpy} ecosystem, and report diagnostics that link uncertainty growth and policy drift to offline instability, motivating principled early stopping and calibration

Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

Abstract

Paper Structure (85 sections, 14 theorems, 85 equations, 9 figures, 2 tables)

This paper contains 85 sections, 14 theorems, 85 equations, 9 figures, 2 tables.

Introduction
Research objective.
Our proposal: BCPO.
Novelty and contributions
Organization.
Related Work
Offline RL: constraining distribution shift and value extrapolation
Action-space constraint and support matching.
Behavior-regularized actor--critic.
Value regularization and pessimism.
Model-based offline RL and uncertainty-aware pessimism
Safe policy improvement from batch data
Bayesian/uncertainty-aware RL and ensemble uncertainty
Summary of positioning.
Problem Setup and Notation
...and 70 more sections

Key Result

Lemma 6.1

Fix $(s,a)$ with $n(s,a)\ge 1$ and assume $r\in[0,1]$. Then with probability at least $1-\delta_{r}(s,a)$ (over the reward draws in $\mathcal{D}$ conditional on the visited $(s,a)$),

Figures (9)

Figure 1: Learning curves of offline RL algorithms. BCPO consistently achieves significantly higher expected return compared with both Behavior Cloning and naive FQI.
Figure 2: Relationship between dataset coverage and posterior uncertainty. As the number of observed transitions increases, the uncertainty bound decreases, validating the theoretical motivation for pessimistic evaluation.
Figure 3: State-value heatmaps learned by BCPO (left) and naive FQI (right). BCPO learns smooth value gradients toward the goal, whereas naive FQI produces unstable and localized value spikes due to extrapolation errors.
Figure 4: Greedy action maps derived from learned policies. BCPO generates a coherent navigation strategy toward the goal, while naive FQI produces inconsistent directional choices caused by inaccurate value estimates.
Figure 5: Learning curves on cartpole-replay. The BC line indicates near-ceiling performance. Naive offline DQN shows severe non-monotonicity and collapse after initially reaching high returns. BCPO initially reaches near-ceiling return but later deteriorates, motivating improved conservatism and/or early-stopping criteria.
...and 4 more figures

Theorems & Definitions (35)

Definition 5.1: Hierarchical prior
Definition 5.2: Posterior mean and variance of the critic
Definition 5.3: Lower credible bound of $Q$
Definition 6.1: Empirical and posterior-mean transitions
Definition 6.2: LCB bonuses
Definition 6.3: Pessimistic Bellman operator
Remark 6.1
Lemma 6.1: Reward LCB via Hoeffding
proof
Lemma 6.2: $\ell_1$-concentration of Dirichlet posterior mean
...and 25 more

Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

Abstract

Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

Authors

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (35)