Diminishing Return of Value Expansion Methods

Daniel Palenicek; Michael Lutter; João Carvalho; Daniel Dennert; Faran Ahmad; Jan Peters

Diminishing Return of Value Expansion Methods

Daniel Palenicek, Michael Lutter, João Carvalho, Daniel Dennert, Faran Ahmad, Jan Peters

TL;DR

This work interrogates whether model accuracy in dynamics limits the sample efficiency of model-based value expansion methods. Using an oracle dynamics model to remove compounding errors, it demonstrates that extending rollout horizons yields only diminishing gains, and that improvements in dynamics accuracy provide marginal benefits over learned models; model-free retrace matches or surpasses model-based variants at lower compute. Across continuous and discrete tasks, the results consistently show that the benefits of perfect dynamics are limited, suggesting the bottleneck lies elsewhere than model fidelity. Consequently, model-free approaches offer competitive performance with far lower computational costs, challenging the primacy of dynamics accuracy in model-based reinforcement learning.

Abstract

Model-based reinforcement learning aims to increase sample efficiency, but the accuracy of dynamics models and the resulting compounding errors are often seen as key limitations. This paper empirically investigates potential sample efficiency gains from improved dynamics models in model-based value expansion methods. Our study reveals two key findings when using oracle dynamics models to eliminate compounding errors. First, longer rollout horizons enhance sample efficiency, but the improvements quickly diminish with each additional expansion step. Second, increased model accuracy only marginally improves sample efficiency compared to learned models with identical horizons. These diminishing returns in sample efficiency are particularly noteworthy when compared to model-free value expansion methods. These model-free algorithms achieve comparable performance without the computational overhead. Our results suggest that the limitation of model-based value expansion methods cannot be attributed to model accuracy. Although higher accuracy is beneficial, even perfect models do not provide unrivaled sample efficiency. Therefore, the bottleneck exists elsewhere. These results challenge the common assumption that model accuracy is the primary constraint in model-based reinforcement learning.

Diminishing Return of Value Expansion Methods

TL;DR

Abstract

Paper Structure (27 sections, 15 equations, 14 figures, 3 tables)

This paper contains 27 sections, 15 equations, 14 figures, 3 tables.

Introduction
Summary of Contributions
Outline
Preliminaries
Reinforcement Learning
Maximum-Entropy Reinforcement Learning
H-step Value Expansion
Actor Expansion
Critic Expansion
Extension to Off-Policy Trajectories
The Diminishing Return of Value Expansion
Diminishing Returns on Continuous Control Tasks
Diminishing Return of Rollout Horizons
Diminishing Return of Model Accuracy
Model-free Value Expansion
...and 12 more sections

Figures (14)

Figure 1: The diminishing return on continuous control tasks. Shows the diminishing return of ce and ae methods using multiple rollout horizons $H$ for a sac agent. ce does not benefit noticeably from horizons larger than 5 in most cases. On average, ce can slightly benefit from oracle dynamics over our learned model, while it is the opposite for ae. Looking at individual runs, oracle and learned dynamics models often perform very similarly. Model-free retrace performs comparably to the learned model and even more stably with regard to longer rollout horizons. For ae, larger horizons can be detrimental, even with an oracle model. We plot the iqm episode undiscounted return (solid line) and $90\%$ipr (shaded area) against the number of real environment interaction steps for $9$ random seeds. Some ae runs for larger $H$ terminate early due to exploding gradients (see Section \ref{['sec:gradient_analysis']} for detailed analysis).
Figure 2: Limited gains in final performance. This figure compares the maximum performances of the different rollout horizons aggregated across environments. The results are derived from the runs from Figure \ref{['fig:sac_mve_ivg_retrace']}. We plot the iqm normalized return and 95% confidence intervals. Only ce (oracle) sees slight improvements with model rollouts, which peak at around 20%. It is clearly visible that additional rollout steps do not contribute significantly, with the confidence intervals mostly overlapping. All other methods do not see any significant gains, and ae with long horizons again shows detrimental drops in performance.
Figure 3: Limited gains in learning speed. This figure shows how long the different value expansion methods take to reach the maximum IQM return of the sac baseline. Above 100% means that the method did not reach the performance of the sac baseline within the considered timeframe. We see that ce (oracle) benefit in terms of the learning speed. However, this benefit quickly diminishes with increasing rollout horizons. All other methods do not benefit considerably, with retrace being marginally faster than ce (NN). ae methods again show detrimental drops in performance for larger horizons.
Figure 4: Diminishing return on MinAtar discrete state-action environments. We show iqm normalized returns aggregated over four environments and 9 seeds, each with $95\%$ confidence intervals. dqn does not benefit from value expansion even with oracle dynamics. Long rollout horizons can even hurt performance.
Figure 5: Analysis of $H$-step targets for multiple checkpoints along the training process. The 1 row shows the undiscounted episodic return of sac along training for one seed. The checkpoints along this single training run are used for the analyses in the following rows. The 2 row shows the Wasserstein distance between the sample-based target distribution predicted by the $H$-step estimator and the true target distribution represented by Monte Carlo samples. 3 and 4 rows display the mean and variance over particles of the sample-based target distributions. The mean targets for different horizons are roughly equal, while the variance increases with the horizon.
...and 9 more figures

Diminishing Return of Value Expansion Methods

TL;DR

Abstract

Diminishing Return of Value Expansion Methods

Authors

TL;DR

Abstract

Table of Contents

Figures (14)