Table of Contents
Fetching ...

Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, Pierre-Luc Bacon

TL;DR

The paper challenges the default reliance on conservatism in offline RL and proposes Neubay, a Bayesian offline RL method that models epistemic uncertainty via a world-model posterior and trains history-dependent agents for test-time generalization. It tackles compounding errors, overestimation, and long-horizon stability through ensembling, LayerNorm in the world model, adaptive long-horizon planning with uncertainty-based rollout truncation, and stable recurrent training. Empirically, Neubay matches or surpasses conservative baselines on D4RL and NeoRL, achieving new state-of-the-art results on several datasets and demonstrating meaningful gains on low- to moderate-quality data, while revealing when Bayesianism is preferable to conservatism. The work lays a foundation for a Bayesian, non-conservative direction in offline and model-based RL and points to future improvements in world modeling and uncertainty quantification to broaden applicability.

Abstract

Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting planning horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale the principle to realistic tasks, identifying key design choices, such as layer normalization in the world model and adaptive long-horizon planning, that mitigate compounding error and value overestimation. These yield our practical algorithm, Neubay, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, Neubay generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with planning horizons of several hundred steps, challenging common belief. Finally, we characterize when Neubay is preferable to conservatism, laying the foundation for a new direction in offline and model-based RL.

Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

TL;DR

The paper challenges the default reliance on conservatism in offline RL and proposes Neubay, a Bayesian offline RL method that models epistemic uncertainty via a world-model posterior and trains history-dependent agents for test-time generalization. It tackles compounding errors, overestimation, and long-horizon stability through ensembling, LayerNorm in the world model, adaptive long-horizon planning with uncertainty-based rollout truncation, and stable recurrent training. Empirically, Neubay matches or surpasses conservative baselines on D4RL and NeoRL, achieving new state-of-the-art results on several datasets and demonstrating meaningful gains on low- to moderate-quality data, while revealing when Bayesianism is preferable to conservatism. The work lays a foundation for a Bayesian, non-conservative direction in offline and model-based RL and points to future improvements in world modeling and uncertainty quantification to broaden applicability.

Abstract

Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting planning horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale the principle to realistic tasks, identifying key design choices, such as layer normalization in the world model and adaptive long-horizon planning, that mitigate compounding error and value overestimation. These yield our practical algorithm, Neubay, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, Neubay generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with planning horizons of several hundred steps, challenging common belief. Finally, we characterize when Neubay is preferable to conservatism, laying the foundation for a new direction in offline and model-based RL.

Paper Structure

This paper contains 40 sections, 7 theorems, 90 equations, 16 figures, 13 tables, 2 algorithms.

Key Result

Proposition 1

If the pessimistic update in eq:model-free-pess converges to a fixed point, then the induced uncertainty set for the corresponding robust MDP is where $m_\mathcal{D}$ is the empirical model (eq:mle) and $s_{\text{absorb}} \not \in \mathcal{D}$ is an artificial absorbing state.

Figures (16)

  • Figure 1: Our algorithm Neubay's result on a D4RL dataset. From left to right: normalized score on the real environment, estimated Q-value on the offline dataset, and rollout horizon statistics over 100 training rollouts (median with interquartile range). Here we vary the uncertainty quantile $\zeta \in \{0.9, 0.99, 0.999, 1.0\}$ for the rollout truncation threshold, without using conservatism.
  • Figure 2: Histogram of estimated reward means $p_0,p_1$ across ensemble members.
  • Figure 3: Average return (normalized by $T$) on test-time bandits with $p^*_1 \in \{0.01, 0.3, 0.55, 0.7, 0.99\}$. Since the observed arm has $p^*_0=0.5$, cases with $p^*_1<0.5$ are worse and those with $p^*_1>0.5$ are better.
  • Figure 4: Empirical CDFs of epistemic uncertainty $U_{\boldsymbol{\theta}}$ over $(s,a)\in\mathrm{supp}_{\mathcal{S}\times \mathcal{A}}(\mathcal{D})$, with logit-scaled y-axis. Uncertainties are normalized by the dataset mean, so $1$ is the average value.
  • Figure 5: Effect of LayerNorm in world models trained and evaluated on halfcheetah-medium-expert-v2. We collect 200 rollouts and truncate only on float32 overflow, without using an uncertainty threshold. For each metric, we plot the median (solid line) together with the 5-95% percentile band across rollouts. The rightmost scatter plot show the Spearman's rank coefficient in the with-LayerNorm setting; vertical lines mark uncertainty thresholds $\zeta \in\{0.9, 0.99,0.999, 1.0\}$. Full results and plotting setup are shown in \ref{['app:LN']}.
  • ...and 11 more figures

Theorems & Definitions (18)

  • Proposition 1
  • proof
  • Definition 1: Model-dependent concentrability
  • Definition 2: Robust concentrability uehara2021pessimistic
  • Definition 3: Bayesian concentrability
  • Proposition 2: Bayesian concentrability is upper-bounded by robust concentrability
  • proof
  • Example 1: Strictness of Bayesian concentrability bound
  • Theorem 1
  • proof : Proof sketch
  • ...and 8 more