Why long model-based rollouts are no reason for bad Q-value estimates

Philipp Wissmann; Daniel Hein; Steffen Udluft; Volker Tresp

Why long model-based rollouts are no reason for bad Q-value estimates

Philipp Wissmann, Daniel Hein, Steffen Udluft, Volker Tresp

TL;DR

This work argues that long model-based rollouts need not cause catastrophic error growth if the policy remains informed and can react to simulated states. By comparing rollout-based Q-value estimates with model-free FQE on CartPole-v1, the authors show significantly lower RMSE and higher correlation for informed rollouts, and demonstrate that replacing bootstrapping with rollout-based targets can markedly improve robustness in offline policy learning. The results provide a practical path to leveraging long-horizon model-based planning to enhance Q-value estimation and stabilize offline RL algorithms. Altogether, the paper challenges the view that long model rollouts are inherently detrimental and highlights the benefits of policy-informed dynamics in offline settings.

Abstract

This paper explores the use of model-based offline reinforcement learning with long model rollouts. While some literature criticizes this approach due to compounding errors, many practitioners have found success in real-world applications. The paper aims to demonstrate that long rollouts do not necessarily result in exponentially growing errors and can actually produce better Q-value estimates than model-free methods. These findings can potentially enhance reinforcement learning techniques.

Why long model-based rollouts are no reason for bad Q-value estimates

TL;DR

Abstract

Paper Structure (7 sections, 5 equations, 3 figures, 2 tables)

This paper contains 7 sections, 5 equations, 3 figures, 2 tables.

Introduction & related work
Experimental setup
Models
Blind vs. informed policy rollouts
Q-value estimates for policies
Policy learning
Conclusion

Figures (3)

Figure 1: Comparing the effect of different model qualities. (a) Learning curve of a pole angle model. Crosses highlight the epochs in which the weights have been saved. (b) Difference between predicting a state trajectory through blind (left) and informed (right) policy rollout.
Figure 2: Comparison of predicted state-values. (a) Scatter plot of predicting state-values model-free (FQE), with model-based rollouts (MBRO), and fitted MBRO. (b) Histogram of the differences of predicted state-values and the true return.
Figure 3: Iteration-wise policy performance averaged over 1,000 gym environment episodes. Blue lines represent the average discounted return over 1,000 episodes each with 5,000 steps. Cross markers depict the quote of episodes reaching 5,000 steps. Green markers represent iterations where perfect policies have been found, i.e., policies balanced successfully in all episodes for at least 5,000 steps.

Why long model-based rollouts are no reason for bad Q-value estimates

TL;DR

Abstract

Why long model-based rollouts are no reason for bad Q-value estimates

Authors

TL;DR

Abstract

Table of Contents

Figures (3)