Mitigating Distribution Shift in Model-based Offline RL via Shifts-aware Reward Learning
Wang Luo, Haoran Li, Zicheng Zhang, Congying Han, Chi Zhou, Jiayu Lv, Tiande Guo
TL;DR
This work addresses distribution shift in model-based offline RL by separating it into model bias and policy shift and deriving a shifts-aware reward (SAR) through a unified probabilistic-inference framework. Building on SAR, the authors propose SAMBO-RL, which learns a transition classifier and an action classifier to approximate the shift-adjusted reward and guides policy optimization with short-horizon model-based rollouts. Theoretical guarantees are provided via a $\xi$-uncertainty quantifier and a PEVI-style bound, and empirical results on D4RL and NeoRL show SAMBO-RL delivers competitive or superior performance while mitigating distribution shift. The approach reduces reliance on value ensembles and improves training stability, though classifier accuracy remains a practical limitation and future work could address DS without explicit classifiers.
Abstract
Model-based offline reinforcement learning trains policies using pre-collected datasets and learned environment models, eliminating the need for direct real-world environment interaction. However, this paradigm is inherently challenged by distribution shift~(DS). Existing methods address this issue by leveraging off-policy mechanisms and estimating model uncertainty, but they often result in inconsistent objectives and lack a unified theoretical foundation. This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift. Our theoretical and empirical investigations reveal how these factors distort value estimation and restrict policy optimization. To tackle these challenges, we derive a novel shifts-aware reward through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training. Building on this, we develop a practical implementation that leverages classifier-based techniques to approximate the adjusted reward for effective policy optimization. Empirical results across multiple benchmarks demonstrate that the proposed approach mitigates distribution shift and achieves superior or comparable performance, validating our theoretical insights.
