Table of Contents
Fetching ...

Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

Michael R. Zhang, Tom Le Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, Ziyu Wang, Mohammad Norouzi

TL;DR

The paper addresses offline policy evaluation and optimization for continuous control by proposing autoregressive dynamics models that sequentially predict next-state dimensions, relaxing the standard conditional independence assumption. The autoregressive approach improves log-likelihood on held-out transitions and yields superior model-based OPE performance compared with state-of-the-art baselines on RL Unplugged datasets, while also enhancing offline policy optimization through planning (MPPI) and data augmentation. Key findings include strong correlations between low validation NLL and accurate OPE with autoregressive models, and state-of-the-art results for offline planning on challenging tasks like Cheetah Run and Fish Swim. These results indicate that richer forward models can directly improve offline RL pipelines and suggest avenues for more sophisticated autoregressive architectures and conservative evaluation techniques in the future.

Abstract

Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we challenge this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning.

Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

TL;DR

The paper addresses offline policy evaluation and optimization for continuous control by proposing autoregressive dynamics models that sequentially predict next-state dimensions, relaxing the standard conditional independence assumption. The autoregressive approach improves log-likelihood on held-out transitions and yields superior model-based OPE performance compared with state-of-the-art baselines on RL Unplugged datasets, while also enhancing offline policy optimization through planning (MPPI) and data augmentation. Key findings include strong correlations between low validation NLL and accurate OPE with autoregressive models, and state-of-the-art results for offline planning on challenging tasks like Cheetah Run and Fish Swim. These results indicate that richer forward models can directly improve offline RL pipelines and suggest avenues for more sophisticated autoregressive architectures and conservative evaluation techniques in the future.

Abstract

Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and may be driven by the fact that fully observable physics-based simulation environments entail deterministic transition dynamics. In this paper, we challenge this conditional independence assumption and propose a family of expressive autoregressive dynamics models that generate different dimensions of the next state and reward sequentially conditioned on previous dimensions. We demonstrate that autoregressive dynamics models indeed outperform standard feedforward models in log-likelihood on heldout transitions. Furthermore, we compare different model-based and model-free off-policy evaluation (OPE) methods on RL Unplugged, a suite of offline MuJoCo datasets, and find that autoregressive dynamics models consistently outperform all baselines, achieving a new state-of-the-art. Finally, we show that autoregressive dynamics models are useful for offline policy optimization by serving as a way to enrich the replay buffer through data augmentation and improving performance using model-based planning.

Paper Structure

This paper contains 17 sections, 5 equations, 10 figures, 5 tables, 2 algorithms.

Figures (10)

  • Figure 1: Standard probabilistic dynamics models ( e.g.,chua2018deep) use a neural network to predict the mean and standard deviation of different dimensions of the next state and reward simultaneously. By contrast, we use the same neural network architectures with several additional inputs and predict the mean and standard deviation of each dimension of the next state conditional on previous dimensions of the next state. As empirical results indicate, this small change makes a big difference in the expressive power of dynamics models. Note that reward prediction is not shown on the right to reduce clutter, but it can be thought of as $(n\!+\!1)$th state dimension.
  • Figure 2: Network parameter count vs. validation negative log-likelihood for autoregressive and feedforward models. Autoregressive models often have a lower validation NLL irrespective of parameter count.
  • Figure 3: Validation negative log-likelihood vs. OPE correlation coefficients on different tasks. On 4 RL Unplugged tasks, we conduct an extensive experiment in which 48 Autoregressive and 48 Feedforward Dynamics models are used for OPE. For each dynamics model, we calculate the correlation coefficient between model-based value estimates and ground truth values at a discount factor of 0.995. We find that low validation NLL numbers generally correspond to accurate policy evaluation, while higher NLL numbers are less meaningful.
  • Figure 4: Comparison of model-based OPE using autoregressive and feedforward dynamics models with state-of-the-art FQE methods based on L2 and distributional Bellman error. We plot OPE estimates on the y-axis against ground truth returns with a discount of $.995$ on the x-axis. We report the Pearson correlation coefficient ($r$) in the title. While feedforward models fall behind FQE on most tasks, autoregressive dynamics models are often superior. See Figure \ref{['fig:ope-compare4x4']} for additional scatter plots on the other environments.
  • Figure 5: Model-based offline policy optimization results. With planning and data augmentation, we improve the performance over CRR exp (our baseline algorithm). When using autoregressive dynamics models (CRR-planning AR), we outperform state-of-the-art on Cheetah run and Fish swim. Previous SOTA results gulcehre2020rlwang2020critic are obtained using different offline RL algorithms: Cheetah run - CRR exp, Fish swim - CRR binary max, Finger turn hard - CRR binary max, Cartpole swingup - BRACwu2019behavior.
  • ...and 5 more figures