Table of Contents
Fetching ...

Mitigating Distribution Shift in Model-based Offline RL via Shifts-aware Reward Learning

Wang Luo, Haoran Li, Zicheng Zhang, Congying Han, Chi Zhou, Jiayu Lv, Tiande Guo

TL;DR

This work addresses distribution shift in model-based offline RL by separating it into model bias and policy shift and deriving a shifts-aware reward (SAR) through a unified probabilistic-inference framework. Building on SAR, the authors propose SAMBO-RL, which learns a transition classifier and an action classifier to approximate the shift-adjusted reward and guides policy optimization with short-horizon model-based rollouts. Theoretical guarantees are provided via a $\xi$-uncertainty quantifier and a PEVI-style bound, and empirical results on D4RL and NeoRL show SAMBO-RL delivers competitive or superior performance while mitigating distribution shift. The approach reduces reliance on value ensembles and improves training stability, though classifier accuracy remains a practical limitation and future work could address DS without explicit classifiers.

Abstract

Model-based offline reinforcement learning trains policies using pre-collected datasets and learned environment models, eliminating the need for direct real-world environment interaction. However, this paradigm is inherently challenged by distribution shift~(DS). Existing methods address this issue by leveraging off-policy mechanisms and estimating model uncertainty, but they often result in inconsistent objectives and lack a unified theoretical foundation. This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift. Our theoretical and empirical investigations reveal how these factors distort value estimation and restrict policy optimization. To tackle these challenges, we derive a novel shifts-aware reward through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training. Building on this, we develop a practical implementation that leverages classifier-based techniques to approximate the adjusted reward for effective policy optimization. Empirical results across multiple benchmarks demonstrate that the proposed approach mitigates distribution shift and achieves superior or comparable performance, validating our theoretical insights.

Mitigating Distribution Shift in Model-based Offline RL via Shifts-aware Reward Learning

TL;DR

This work addresses distribution shift in model-based offline RL by separating it into model bias and policy shift and deriving a shifts-aware reward (SAR) through a unified probabilistic-inference framework. Building on SAR, the authors propose SAMBO-RL, which learns a transition classifier and an action classifier to approximate the shift-adjusted reward and guides policy optimization with short-horizon model-based rollouts. Theoretical guarantees are provided via a -uncertainty quantifier and a PEVI-style bound, and empirical results on D4RL and NeoRL show SAMBO-RL delivers competitive or superior performance while mitigating distribution shift. The approach reduces reliance on value ensembles and improves training stability, though classifier accuracy remains a practical limitation and future work could address DS without explicit classifiers.

Abstract

Model-based offline reinforcement learning trains policies using pre-collected datasets and learned environment models, eliminating the need for direct real-world environment interaction. However, this paradigm is inherently challenged by distribution shift~(DS). Existing methods address this issue by leveraging off-policy mechanisms and estimating model uncertainty, but they often result in inconsistent objectives and lack a unified theoretical foundation. This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift. Our theoretical and empirical investigations reveal how these factors distort value estimation and restrict policy optimization. To tackle these challenges, we derive a novel shifts-aware reward through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training. Building on this, we develop a practical implementation that leverages classifier-based techniques to approximate the adjusted reward for effective policy optimization. Empirical results across multiple benchmarks demonstrate that the proposed approach mitigates distribution shift and achieves superior or comparable performance, validating our theoretical insights.
Paper Structure (34 sections, 47 equations, 15 figures, 5 tables, 3 algorithms)

This paper contains 34 sections, 47 equations, 15 figures, 5 tables, 3 algorithms.

Figures (15)

  • Figure 1: The negative impacts of model bias and policy shift. (a) In the 1D grid world, the agent can take two actions: move left (L) or move right (R) from the current state $S$ to the adjacent grid cells $S-1$ or $S+1$, with movement obstructed at the boundaries. The initial state is uniformly distributed, and the agent receives undiscounted rewards upon reaching specific states. (b) The left panel illustrates value estimates during training and the final policy convergence with and without model bias. Even slight model bias can significantly distort value function estimates, leading the learned policies to deviate from the true optimal policy in the actual environment. (c) The right panel showcases expected returns and the KL divergence between the learned policy and the behavior policy with and without policy shift. We find that policy shift implicitly slows policy convergence.
  • Figure 2: Effectiveness of the shifts-aware reward. (a) SAR is capable of achieving a near-optimal return. (b) and (c) The $(s, a, s^{\prime})$ distributions of offline RL algorithms result in the Hopper-M task in NeoRL benchmark. Model Data represents the distribution of data generated by executing the final learned policy in the dynamic model, while Real Data represents the distribution of data generated by executing the final learned policy in the real environment.
  • Figure 3: Performance comparison of Hopper tasks in NeoRL benchmark.
  • Figure 4: Illustration of ablation experiments in NeoRL benchmark.
  • Figure 5: The tuning experiment was performed on halfchetah-random datasets in D4RL benchmark.
  • ...and 10 more figures

Theorems & Definitions (2)

  • definition 1
  • definition 2