Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning

Jing Zhang; Chi Zhang; Wenjia Wang; Bing-Yi Jing

Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning

Jing Zhang, Chi Zhang, Wenjia Wang, Bing-Yi Jing

TL;DR

A Constrained Policy optimization with Explicit Behavior density (CPED) method that utilizes a flow-GAN model to explicitly estimate the density of behavior policy is proposed, which outperforms existing alternatives on various standard offline reinforcement learning tasks, yielding higher expected returns.

Abstract

Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Existing methods for addressing this issue either control policy to exclude the OOD action or make the $Q$ function pessimistic. However, these methods can be overly conservative or fail to identify OOD areas accurately. To overcome this problem, we propose a Constrained Policy optimization with Explicit Behavior density (CPED) method that utilizes a flow-GAN model to explicitly estimate the density of behavior policy. By estimating the explicit density, CPED can accurately identify the safe region and enable optimization within the region, resulting in less conservative learning policies. We further provide theoretical results for both the flow-GAN estimator and performance guarantee for CPED by showing that CPED can find the optimal $Q$-function value. Empirically, CPED outperforms existing alternatives on various standard offline reinforcement learning tasks, yielding higher expected returns.

Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning

TL;DR

Abstract

function pessimistic. However, these methods can be overly conservative or fail to identify OOD areas accurately. To overcome this problem, we propose a Constrained Policy optimization with Explicit Behavior density (CPED) method that utilizes a flow-GAN model to explicitly estimate the density of behavior policy. By estimating the explicit density, CPED can accurately identify the safe region and enable optimization within the region, resulting in less conservative learning policies. We further provide theoretical results for both the flow-GAN estimator and performance guarantee for CPED by showing that CPED can find the optimal

-function value. Empirically, CPED outperforms existing alternatives on various standard offline reinforcement learning tasks, yielding higher expected returns.

Paper Structure (26 sections, 5 theorems, 56 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 5 theorems, 56 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Probability Controlled Offline RL Framework
Offline RL Problem Settings
Behavior Policy Estimation by GAN with Specific Density
Constrained Policy Optimization with Explicit Behavior Density Algorithm (CPED)
Theoretical Analysis
Convergence of GAN with the Hybrid Loss
Convergence of CPED
Experiments
Performance Comparison on Standard Benchmarking Datasets for Offline RL
Ablation Study
Conclusion and Future Work
Convergence of GAN with the hybrid loss
Estimating density of behavior policy using MaxEnt IRL is equivalent to training a GAN with specific likelihood function
...and 11 more sections

Key Result

Proposition 3.1

For offline dataset $\mathcal{D}$ generated by behavior policy $\pi_{\beta}$, the learned likelihood function $L^{\pi_{\beta}}$, using GAN with hybrid loss in Eq.eq4 is equivalent to that trained by MaxEnt IRL. If the generator of GAN can give a specific likelihood function $p_{\theta}^G(\tau)$, the where $C$ is a constant related to $\mathcal{D}$.

Figures (8)

Figure 1: (a): The ground truth safe area in offline RL optimization, and the updates of policies and $Q$-functions are done within the green area. The blue points are collected behavior data $\mathcal{D}$, and the red point denotes the optimal policy given the states. (b): In previous approaches, the exploration of the policy takes place in a small neighborhood of points in $\mathcal{D}$ (the orange circles). (c): The CPED relaxes the exploration area and constructs the feasible region (pink areas), which includes the unobserved but safe points (black point).
Figure 2: (a) Average performance of BEAR and CPED on halfcheetah-medium task averaged over 5 seeds. BEAR can reach a bottleneck very quickly. CPED remain increasing after reaching the bottleneck. (b) The time(epoch) varying constrain parameter $\alpha$ used in Gym-MuJoCo task. (c) The time(epoch) varying constrain parameter $\alpha$ used in AntMaze task
Figure 3: Training curve of different Mujoco Tasks. All results are averaged across 5 random seeds. Each epoch contains 1000 training steps.
Figure 4: Training curve of different Antmaze Tasks. All results are averaged across 5 random seeds. Each epoch contains 1000 training steps.
Figure 5: Target Q function of different Mujoco Tasks. All results are averaged across 5 random seeds. Each epoch contains 1000 training steps.
...and 3 more figures

Theorems & Definitions (12)

Definition 3.1: Offline MDP
Remark 3.1
Proposition 3.1
Theorem 4.1: Informal
Remark 4.1
Remark 4.2
Theorem 4.2
Theorem 4.3
Theorem A.1
Remark A.1
...and 2 more

Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning

TL;DR

Abstract

Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (12)