Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning
Haoxin Lin, Yu-Yan Xu, Yihao Sun, Zhilong Zhang, Yi-Chen Li, Chengxing Jia, Junyin Ye, Jiaji Zhang, Yang Yu
TL;DR
This paper tackles the bootstrapping-induced compounding error in model-based RL by introducing Any-step Dynamics Model (ADM), which supports direct prediction from variable-length backtracked plans. Two ADM-based algorithms are proposed: ADMPO-ON for online settings and ADMPO-OFF for offline settings, both leveraging ADM to improve future state predictions and to quantify model uncertainty without ensembles. Empirical results show that ADM reduces compounding error and yields superior sample efficiency online (MuJoCo) and stronger offline performance on D4RL and NeoRL, with credible uncertainty quantification that tracks actual model error closely. The work demonstrates ADM’s potential to enhance data efficiency and reliability in both online and offline reinforcement learning, with practical implications for safer and more robust deployment.
Abstract
Model-based methods in reinforcement learning offer a promising approach to enhance data efficiency by facilitating policy exploration within a dynamics model. However, accurately predicting sequential steps in the dynamics model remains a challenge due to the bootstrapping prediction, which attributes the next state to the prediction of the current state. This leads to accumulated errors during model roll-out. In this paper, we propose the Any-step Dynamics Model (ADM) to mitigate the compounding error by reducing bootstrapping prediction to direct prediction. ADM allows for the use of variable-length plans as inputs for predicting future states without frequent bootstrapping. We design two algorithms, ADMPO-ON and ADMPO-OFF, which apply ADM in online and offline model-based frameworks, respectively. In the online setting, ADMPO-ON demonstrates improved sample efficiency compared to previous state-of-the-art methods. In the offline setting, ADMPO-OFF not only demonstrates superior performance compared to recent state-of-the-art offline approaches but also offers better quantification of model uncertainty using only a single ADM.
