Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning

Haoxin Lin; Yu-Yan Xu; Yihao Sun; Zhilong Zhang; Yi-Chen Li; Chengxing Jia; Junyin Ye; Jiaji Zhang; Yang Yu

Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning

Haoxin Lin, Yu-Yan Xu, Yihao Sun, Zhilong Zhang, Yi-Chen Li, Chengxing Jia, Junyin Ye, Jiaji Zhang, Yang Yu

TL;DR

This paper tackles the bootstrapping-induced compounding error in model-based RL by introducing Any-step Dynamics Model (ADM), which supports direct prediction from variable-length backtracked plans. Two ADM-based algorithms are proposed: ADMPO-ON for online settings and ADMPO-OFF for offline settings, both leveraging ADM to improve future state predictions and to quantify model uncertainty without ensembles. Empirical results show that ADM reduces compounding error and yields superior sample efficiency online (MuJoCo) and stronger offline performance on D4RL and NeoRL, with credible uncertainty quantification that tracks actual model error closely. The work demonstrates ADM’s potential to enhance data efficiency and reliability in both online and offline reinforcement learning, with practical implications for safer and more robust deployment.

Abstract

Model-based methods in reinforcement learning offer a promising approach to enhance data efficiency by facilitating policy exploration within a dynamics model. However, accurately predicting sequential steps in the dynamics model remains a challenge due to the bootstrapping prediction, which attributes the next state to the prediction of the current state. This leads to accumulated errors during model roll-out. In this paper, we propose the Any-step Dynamics Model (ADM) to mitigate the compounding error by reducing bootstrapping prediction to direct prediction. ADM allows for the use of variable-length plans as inputs for predicting future states without frequent bootstrapping. We design two algorithms, ADMPO-ON and ADMPO-OFF, which apply ADM in online and offline model-based frameworks, respectively. In the online setting, ADMPO-ON demonstrates improved sample efficiency compared to previous state-of-the-art methods. In the offline setting, ADMPO-OFF not only demonstrates superior performance compared to recent state-of-the-art offline approaches but also offers better quantification of model uncertainty using only a single ADM.

Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning

TL;DR

Abstract

Paper Structure (33 sections, 3 theorems, 16 equations, 6 figures, 5 tables, 3 algorithms)

This paper contains 33 sections, 3 theorems, 16 equations, 6 figures, 5 tables, 3 algorithms.

Introduction
Preliminaries
Markov Decision Process and Reinforcement Learning
Model-based Reinforcement Learning
Method
Any-step Dynamics Model
ADMPO-ON: ADM for Policy Optimization in Online Setting
ADMPO-OFF: ADM for Policy Optimization in Offline Setting
Experiments
Dynamics Model Evaluation
Evaluation in Online Setting
Evaluation in Offline Setting
D4RL Benchmark Results
NeoRL Benchmark Results
Uncertainty Quantification
...and 18 more sections

Key Result

Theorem 3.4

$\beta\cdot \mathcal{U}^{\mathrm{ADM}}$ is a valid $\xi$-uncertainty quantifier, with $\beta=b\frac{\gamma r_{\mathrm{max}}}{1-\gamma}$. Specifically, where $\hat{\mathcal{T}}^\pi$ is the proxy Bellman operator induced by ADM to estimate the true Bellman operator $\mathcal{T}^\pi$.

Figures (6)

Figure 1: Illustration of any-step dynamics model (left) structured using RNN and its application for next-step prediction with random backtracking (right).
Figure 2: Comparison among ADM, ensemble dynamics model, and bootstrapping RNN dynamics model, in terms of the growth curve of the compounding error as roll-out length increases, after offline learning. The overflow value is regarded as the maximum value of float32.
Figure 3: Online learning curves of ADMPO-ON (red) and other five baselines on four MuJoCo-v3 tasks. The blue dashed lines indicate the asymptotic performance of SAC for reference. The solid lines indicate the mean while the shaded areas indicate the standard error over five different seeds.
Figure 4: Comparison between ADM and ensemble model in uncertainty quantification.
Figure 5: Comparison between ADMPO-ON and MBPO on Humanoid, in terms of (a) model mean squared error, (b) model roll-out standard deviation over diverse predictions, (c) estimated Lipschitz constant ensemble_necessary of Q, and (d) value-aware model error value_aware_model_error. Results are averaged over five seeds.
...and 1 more figures

Theorems & Definitions (9)

Definition 3.1: Any-step Dynamics Model
Definition 3.2: ADM-Uncertainty Quantifier
Theorem 3.4
proof
Definition A.1: $\xi$-Uncertainty Quantifier (proposed by pevi)
Theorem A.2: Suboptimality of PEVI (proposed by pevi)
proof
Theorem B.1
proof

Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning

TL;DR

Abstract

Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)