Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows
Dmitriy Akimov, Vladislav Kurenkov, Alexander Nikulin, Denis Tarasov, Sergey Kolesnikov
TL;DR
Offline RL suffers from extrapolation error and distributional shift when learning from fixed datasets. The authors propose Conservative Normalizing Flows to construct a bounded latent action space via a pre-trained conditional NF and to train the policy in latent space, avoiding out-of-distribution actions by design. They show that this approach, with tanh-bounded latent outputs and a uniform base distribution, yields strong performance on D4RL locomotion and maze2d benchmarks, often outperforming state-of-the-art generative-model baselines. Ablation studies demonstrate the benefits of a uniform latent space, the necessity of constraining the latent outputs, and the superiority of NF-based encoders over VAEs. This work offers a principled, clipping-free route to conservatism in offline RL with practical gains on challenging control tasks.
Abstract
Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism - i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model - controller in the latent space - is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.
