Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning
Jiayu Chen, Le Xu, Wentse Chen, Jeff Schneider
TL;DR
This paper reframes offline model-based reinforcement learning as a Bayes Adaptive MDP to address uncertainty over learned world models. It introduces Continuous Bayes Adaptive Monte Carlo Tree Search (BAMCP) for planning in continuous, stochastic BAMDPs and embeds it into a policy-iteration framework, yielding an RL+Search approach. Through extensive experiments on D4RL MuJoCo benchmarks and tokamak-control tasks, the authors demonstrate that deep ensembles for model uncertainty, together with Bayes-adaptive planning, substantially improve policy performance, with reward penalties mitigating model leakage. The work advances data-efficient, robust offline control by combining Bayesian reasoning, deep search, and offline data to produce real-time deployable policies.
Abstract
Offline RL is a powerful approach for data-driven decision-making and control. Compared to model-free methods, offline model-based RL (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further propose a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our ``RL + Search" framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state-of-the-art offline RL methods on twelve D4RL MuJoCo tasks and three target tracking tasks in a challenging, stochastic tokamak control simulator. The codebase is available at: https://github.com/LucasCJYSDL/Offline-RL-Kit.
