Table of Contents
Fetching ...

Learning to Combat Compounding-Error in Model-Based Reinforcement Learning

Chenjun Xiao, Yifan Wu, Chen Ma, Dale Schuurmans, Martin Müller

TL;DR

This work tackles the catastrophic compounding errors in model-based RL by learning state-dependent planning horizons. It introduces AdaMVE, which learns an $h$-step cumulative model error via TD updates and uses a softmax-based, state-dependent horizon selector to blend multi-step value estimates. A Wasserstein-distance-based error bound guides horizon selection, and a reference policy stabilizes learning. Empirical results in gridworld and continuous control show AdaMVE improves sample efficiency and robustness over both model-based and model-free baselines, with extensions like selective model learning enhancing performance when models are online.

Abstract

Despite its potential to improve sample complexity versus model-free approaches, model-based reinforcement learning can fail catastrophically if the model is inaccurate. An algorithm should ideally be able to trust an imperfect model over a reasonably long planning horizon, and only rely on model-free updates when the model errors get infeasibly large. In this paper, we investigate techniques for choosing the planning horizon on a state-dependent basis, where a state's planning horizon is determined by the maximum cumulative model error around that state. We demonstrate that these state-dependent model errors can be learned with Temporal Difference methods, based on a novel approach of temporally decomposing the cumulative model errors. Experimental results show that the proposed method can successfully adapt the planning horizon to account for state-dependent model accuracy, significantly improving the efficiency of policy learning compared to model-based and model-free baselines.

Learning to Combat Compounding-Error in Model-Based Reinforcement Learning

TL;DR

This work tackles the catastrophic compounding errors in model-based RL by learning state-dependent planning horizons. It introduces AdaMVE, which learns an -step cumulative model error via TD updates and uses a softmax-based, state-dependent horizon selector to blend multi-step value estimates. A Wasserstein-distance-based error bound guides horizon selection, and a reference policy stabilizes learning. Empirical results in gridworld and continuous control show AdaMVE improves sample efficiency and robustness over both model-based and model-free baselines, with extensions like selective model learning enhancing performance when models are online.

Abstract

Despite its potential to improve sample complexity versus model-free approaches, model-based reinforcement learning can fail catastrophically if the model is inaccurate. An algorithm should ideally be able to trust an imperfect model over a reasonably long planning horizon, and only rely on model-free updates when the model errors get infeasibly large. In this paper, we investigate techniques for choosing the planning horizon on a state-dependent basis, where a state's planning horizon is determined by the maximum cumulative model error around that state. We demonstrate that these state-dependent model errors can be learned with Temporal Difference methods, based on a novel approach of temporally decomposing the cumulative model errors. Experimental results show that the proposed method can successfully adapt the planning horizon to account for state-dependent model accuracy, significantly improving the efficiency of policy learning compared to model-based and model-free baselines.

Paper Structure

This paper contains 18 sections, 1 theorem, 17 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Given any policy $\pi$, an approximate model $\hat{P}$, and a reference value function $\bar{V}$, for planning horizon $H$ we have where $W^\pi(s) = \mathbb{E}_{a\sim \pi(\cdot|s)}\left[W(s,a)\right]$ with $W(s, a) = W( P(\cdot\vert s, a), \hat{P}(\cdot\vert s, a))$ being the Wasserstein distance and $K=\sup_{h} \left\lVert \hat{V}^\pi_{\hat{P},h} \right\rVert_L$ is the maximum Lipschitzness of t

Figures (9)

  • Figure 1: Illustration of adaptive planning horizon in FourRoom with an imperfect model. The model is perfect in three rooms while totally wrong in the left bottom room. Non-adaptive MVE diverges due to the large model errors (right). In contrast, AdaMVE is able to adapt the planning horizon at different state (see (a), darker color means longer planning horizon), outperforming both the model-based and model-free baselines.
  • Figure 2: FourRoom Env
  • Figure 3: Visualization of learned planning horizon on FourRoom. For each state, the average horizon$\bar{H}(s)$ weighted by \ref{['eq:h-policy']} is presented. We use $H_\text{max}=5$ thus $\bar{H}(s)<=H_\text{max}/2=2.5$ from definition. Our method can successfully adapt the planning horizon when the model is imperfect.
  • Figure 4: Policy learning performance with different models. The shaded area shows the standard error. Results clearly show that AdaMVE significantly outperforms MVE when the model is imperfect (no wall model and 3room model).
  • Figure 5: PointMass Navigation example environments.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof