Table of Contents
Fetching ...

Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows

Dmitriy Akimov, Vladislav Kurenkov, Alexander Nikulin, Denis Tarasov, Sergey Kolesnikov

TL;DR

Offline RL suffers from extrapolation error and distributional shift when learning from fixed datasets. The authors propose Conservative Normalizing Flows to construct a bounded latent action space via a pre-trained conditional NF and to train the policy in latent space, avoiding out-of-distribution actions by design. They show that this approach, with tanh-bounded latent outputs and a uniform base distribution, yields strong performance on D4RL locomotion and maze2d benchmarks, often outperforming state-of-the-art generative-model baselines. Ablation studies demonstrate the benefits of a uniform latent space, the necessity of constraining the latent outputs, and the superiority of NF-based encoders over VAEs. This work offers a principled, clipping-free route to conservatism in offline RL with practical gains on challenging control tasks.

Abstract

Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism - i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model - controller in the latent space - is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.

Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows

TL;DR

Offline RL suffers from extrapolation error and distributional shift when learning from fixed datasets. The authors propose Conservative Normalizing Flows to construct a bounded latent action space via a pre-trained conditional NF and to train the policy in latent space, avoiding out-of-distribution actions by design. They show that this approach, with tanh-bounded latent outputs and a uniform base distribution, yields strong performance on D4RL locomotion and maze2d benchmarks, often outperforming state-of-the-art generative-model baselines. Ablation studies demonstrate the benefits of a uniform latent space, the necessity of constraining the latent outputs, and the superiority of NF-based encoders over VAEs. This work offers a principled, clipping-free route to conservatism in offline RL with practical gains on challenging control tasks.

Abstract

Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism - i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model - controller in the latent space - is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.
Paper Structure (15 sections, 8 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 15 sections, 8 equations, 9 figures, 3 tables, 2 algorithms.

Figures (9)

  • Figure 1: Schematic visualization and comparison of PLAS and CNF (ours) approaches. Both methods use an action encoder-decoder model trained in a supervised manner on an offline dataset and a controller model to select actions from the latent space of the encoder. PLAS algorithm uses VAE with normal latent distribution with unbounded support (represented as the blue circle) and restricts latent policy outputs to only a part of the latent space (represented as the black borders inside of latent space). Our algorithm uses Normalizing Flow instead of VAE and bound base distribution itself, allowing the latent policy to use the whole latent space.
  • Figure 2: A toy example to demonstrate that both NFs either with Normal or Uniform latent distributions can recover the training data. However, as we demonstrated above, a potential controller trained in the latent space of normal-based NFs is still able to sample actions outside of the training dataset.
  • Figure 3: A toy example demonstrates that the normal-based NF can be manipulated by a controller to produce out-of-distribution training points. We model this by increasing the amplitude of the latent space samples, as we found this to happen in the preliminary experiments when the controller tries to maximize q-values. Note that the model with uniform latent space does not suffer from this problem because it already uses the whole latent distribution support during the training and sampling processes.
  • Figure 4: Average normalized performance on D4RL locomotion tasks. The x-axis denotes the training steps. Each curve is averaged over 3 random seeds. Shaded area represents one standard deviation.
  • Figure 5: Comparison of the proposed method with uniform (CNF) and normal (NF) latent spaces. Policy performance is significantly worse when the latent space is normal.
  • ...and 4 more figures