Table of Contents
Fetching ...

Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning

Jiayu Chen, Le Xu, Wentse Chen, Jeff Schneider

TL;DR

This paper reframes offline model-based reinforcement learning as a Bayes Adaptive MDP to address uncertainty over learned world models. It introduces Continuous Bayes Adaptive Monte Carlo Tree Search (BAMCP) for planning in continuous, stochastic BAMDPs and embeds it into a policy-iteration framework, yielding an RL+Search approach. Through extensive experiments on D4RL MuJoCo benchmarks and tokamak-control tasks, the authors demonstrate that deep ensembles for model uncertainty, together with Bayes-adaptive planning, substantially improve policy performance, with reward penalties mitigating model leakage. The work advances data-efficient, robust offline control by combining Bayesian reasoning, deep search, and offline data to produce real-time deployable policies.

Abstract

Offline RL is a powerful approach for data-driven decision-making and control. Compared to model-free methods, offline model-based RL (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further propose a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our ``RL + Search" framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state-of-the-art offline RL methods on twelve D4RL MuJoCo tasks and three target tracking tasks in a challenging, stochastic tokamak control simulator. The codebase is available at: https://github.com/LucasCJYSDL/Offline-RL-Kit.

Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning

TL;DR

This paper reframes offline model-based reinforcement learning as a Bayes Adaptive MDP to address uncertainty over learned world models. It introduces Continuous Bayes Adaptive Monte Carlo Tree Search (BAMCP) for planning in continuous, stochastic BAMDPs and embeds it into a policy-iteration framework, yielding an RL+Search approach. Through extensive experiments on D4RL MuJoCo benchmarks and tokamak-control tasks, the authors demonstrate that deep ensembles for model uncertainty, together with Bayes-adaptive planning, substantially improve policy performance, with reward penalties mitigating model leakage. The work advances data-efficient, robust offline control by combining Bayesian reasoning, deep search, and offline data to produce real-time deployable policies.

Abstract

Offline RL is a powerful approach for data-driven decision-making and control. Compared to model-free methods, offline model-based RL (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further propose a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our ``RL + Search" framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state-of-the-art offline RL methods on twelve D4RL MuJoCo tasks and three target tracking tasks in a challenging, stochastic tokamak control simulator. The codebase is available at: https://github.com/LucasCJYSDL/Offline-RL-Kit.

Paper Structure

This paper contains 22 sections, 1 theorem, 7 equations, 4 figures, 11 tables, 1 algorithm.

Key Result

Lemma 1.1

For all suffix histories $h'$ of $h$, $b(\theta|h') = \Tilde{b}(\theta|h')$. Here, $b(\theta|h')$ is the true posterior probability of $\theta$ at the decision point $h'$, while $\Tilde{b}(\theta|h')$ is the probability of experiencing $\theta$ at $h'$ when using root sampling.

Figures (4)

  • Figure 1: Performance of Sampled EfficientZero on D4RL MuJoCo tasks. The results for HalfCheetah, Hopper, and Walker2d are presented in the three rows, respectively. Each subfigure depicts the change in undiscounted episodic return as a function of the number of training samples. Experiments are repeated three times with different random seeds, with the solid line representing the mean and the shaded area indicating the 95% confidence interval. For reference, the expert-level episodic returns for HalfCheetah, Hopper, and Walker2d are 12135, 3234.3, and 4592.3, respectively.
  • Figure 2: Evaluation results for the tokamak control tasks. The figure shows the change in episodic returns over training epochs for the proposed algorithms and baselines across three target tracking tasks in the nuclear fusion scenario. Solid lines represent the average performance, while shaded areas indicate the 95% confidence intervals.
  • Figure 3: Belief adaptation during offline and imaginary rollouts. (a) shows the belief over twelve ensemble members, each represented by a specific color, adapting to an offline trajectory of Hopper-med-expert. (b), (c), and (d) illustrate the belief changes during imaginary rollouts which start from the beginning, middle, and end of the offline trajectory shown in (a), respectively.
  • Figure 4: Performance of our proposed algorithms on D4RL MuJoCo tasks. The results for HalfCheetah, Hopper, and Walker2d are presented in the three rows, respectively. Each subfigure depicts the change in the undiscounted episodic return as a function of training epochs. Experiments are repeated three times with different random seeds, with the solid line representing the mean and the shaded area indicating the 95% confidence interval. For reference, the expert-level episodic returns for HalfCheetah, Hopper, and Walker2d are 12135, 3234.3, and 4592.3, respectively. Note that the training epochs for each algorithm, as listed in Table \ref{['table:3']}, have been linearly scaled to 800 for better visualization.

Theorems & Definitions (2)

  • Lemma 1.1
  • proof