Learning to Combat Compounding-Error in Model-Based Reinforcement Learning
Chenjun Xiao, Yifan Wu, Chen Ma, Dale Schuurmans, Martin Müller
TL;DR
This work tackles the catastrophic compounding errors in model-based RL by learning state-dependent planning horizons. It introduces AdaMVE, which learns an $h$-step cumulative model error via TD updates and uses a softmax-based, state-dependent horizon selector to blend multi-step value estimates. A Wasserstein-distance-based error bound guides horizon selection, and a reference policy stabilizes learning. Empirical results in gridworld and continuous control show AdaMVE improves sample efficiency and robustness over both model-based and model-free baselines, with extensions like selective model learning enhancing performance when models are online.
Abstract
Despite its potential to improve sample complexity versus model-free approaches, model-based reinforcement learning can fail catastrophically if the model is inaccurate. An algorithm should ideally be able to trust an imperfect model over a reasonably long planning horizon, and only rely on model-free updates when the model errors get infeasibly large. In this paper, we investigate techniques for choosing the planning horizon on a state-dependent basis, where a state's planning horizon is determined by the maximum cumulative model error around that state. We demonstrate that these state-dependent model errors can be learned with Temporal Difference methods, based on a novel approach of temporally decomposing the cumulative model errors. Experimental results show that the proposed method can successfully adapt the planning horizon to account for state-dependent model accuracy, significantly improving the efficiency of policy learning compared to model-based and model-free baselines.
