Table of Contents
Fetching ...

Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning

Marvin Alles, Philip Becker-Ehmck, Patrick van der Smagt, Maximilian Karl

TL;DR

Constrained Latent Action Policies (C-LAP) tackle offline model-based reinforcement learning by learning a joint autoregressive model of states and actions with a latent action space. The policy operates in latent action space and is explicitly constrained to the dataset’s action support via the latent action prior and action decoder, removing the need for external uncertainty penalties. Empirical results on D4RL and especially V-D4RL show that C-LAP is competitive with state-of-the-art methods and provides significant advantages in visually observed tasks, while ablations underscore the critical role of the latent-action constraint. The approach accelerates policy learning, reduces value overestimation, and offers a practical framework for robust offline planning in environments with rich observations.

Abstract

In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment. In contrast to the online setting, only using static datasets poses additional challenges, such as policies generating out-of-distribution samples. Model-based offline reinforcement learning methods try to overcome these by learning a model of the underlying dynamics of the environment and using it to guide policy search. It is beneficial but, with limited datasets, errors in the model and the issue of value overestimation among out-of-distribution states can worsen performance. Current model-based methods apply some notion of conservatism to the Bellman update, often implemented using uncertainty estimation derived from model ensembles. In this paper, we propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions. We cast policy learning as a constrained objective to always stay within the support of the latent action distribution, and use the generative capabilities of the model to impose an implicit constraint on the generated actions. Thereby eliminating the need to use additional uncertainty penalties on the Bellman update and significantly decreasing the number of gradient steps required to learn a policy. We empirically evaluate C-LAP on the D4RL and V-D4RL benchmark, and show that C-LAP is competitive to state-of-the-art methods, especially outperforming on datasets with visual observations.

Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning

TL;DR

Constrained Latent Action Policies (C-LAP) tackle offline model-based reinforcement learning by learning a joint autoregressive model of states and actions with a latent action space. The policy operates in latent action space and is explicitly constrained to the dataset’s action support via the latent action prior and action decoder, removing the need for external uncertainty penalties. Empirical results on D4RL and especially V-D4RL show that C-LAP is competitive with state-of-the-art methods and provides significant advantages in visually observed tasks, while ablations underscore the critical role of the latent-action constraint. The approach accelerates policy learning, reduces value overestimation, and offers a practical framework for robust offline planning in environments with rich observations.

Abstract

In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment. In contrast to the online setting, only using static datasets poses additional challenges, such as policies generating out-of-distribution samples. Model-based offline reinforcement learning methods try to overcome these by learning a model of the underlying dynamics of the environment and using it to guide policy search. It is beneficial but, with limited datasets, errors in the model and the issue of value overestimation among out-of-distribution states can worsen performance. Current model-based methods apply some notion of conservatism to the Bellman update, often implemented using uncertainty estimation derived from model ensembles. In this paper, we propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions. We cast policy learning as a constrained objective to always stay within the support of the latent action distribution, and use the generative capabilities of the model to impose an implicit constraint on the generated actions. Thereby eliminating the need to use additional uncertainty penalties on the Bellman update and significantly decreasing the number of gradient steps required to learn a policy. We empirically evaluate C-LAP on the D4RL and V-D4RL benchmark, and show that C-LAP is competitive to state-of-the-art methods, especially outperforming on datasets with visual observations.

Paper Structure

This paper contains 23 sections, 18 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of C-LAP. (a) The model is trained offline. It is encoding observations $o_t$ and actions $a_t$ to latent states (gray circle) and latent actions $u_t$ (green circle), and decoding them thereafter. Furthermore it is predicting rewards $\hat{r}_t$. (b) The policy is learned in the latent action space, but constrained to the support of the latent action prior, and uses the generative capabilities of the action decoder. Gradients are computed by back-propagating estimated values $\hat{V}_t$ and rewards $\hat{r}_t$ through the imagined trajectories. (c) The latent action policy is used in the real world, again constrained to the support of the latent action prior and using the generative action decoder.
  • Figure 2: Recurrent latent action state-space model. The generative process is shown by solid lines and inference by dashed lines. Stochastic variables are denoted by circles and deterministic variables by rectangles.
  • Figure 3: Policy constraint through explicit parametrization by using a linear transformation $g$ of the latent action prior $p_{\theta}(u_t \mid s_t)$ and the bounded policy $\pi_{\psi}(u_t \mid s_t) \in [-1, 1]$. The generated actions $a_t \sim p(a_t \mid s_t, g(\pi_{\psi}(u_t \mid s_t), p_{\theta}(u_t \mid s_t))$ are implicitly constrained to the data distribution.
  • Figure 4: Evaluation on low-dimensional feature observations using D4RL benchmark datasets. We plot mean and standard deviation of normalized returns over 4 seeds.
  • Figure 5: Evaluation on visual observations using V-D4RL benchmark datasets. We plot mean and standard deviation of normalized returns over 4 seeds.
  • ...and 4 more figures