Combating the Compounding-Error Problem with a Multi-step Model
Kavosh Asadi, Dipendra Misra, Seungchan Kim, Michel L. Littman
TL;DR
The paper tackles the compounding-error problem in model-based RL by introducing M^3, a multi-step transition model that directly predicts h-step outcomes and uses a fixed-start rollout to prevent feedback of noisy intermediate predictions. It provides theoretical value-function and generalization bounds showing decreased horizon-dependence and demonstrates empirically that M^3 improves both background and decision-time planning across multiple domains, reducing planning errors and improving sample efficiency. The work highlights computational considerations, discusses extensions to stochastic dynamics and ensembles, and outlines future directions for applying multi-step modeling to more complex domains. Collectively, the study argues that multi-step models offer a principled and practical path to more reliable model-based RL.
Abstract
Model-based reinforcement learning is an appealing framework for creating agents that learn, plan, and act in sequential environments. Model-based algorithms typically involve learning a transition model that takes a state and an action and outputs the next state---a one-step model. This model can be composed with itself to enable predicting multiple steps into the future, but one-step prediction errors can get magnified, leading to unacceptable inaccuracy. This compounding-error problem plagues planning and undermines model-based reinforcement learning. In this paper, we address the compounding-error problem by introducing a multi-step model that directly outputs the outcome of executing a sequence of actions. Novel theoretical and empirical results indicate that the multi-step model is more conducive to efficient value-function estimation, and it yields better action selection compared to the one-step model. These results make a strong case for using multi-step models in the context of model-based reinforcement learning.
