A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

Yunpeng Qing; Shunyu liu; Jingyuan Cong; Kaixuan Chen; Yihe Zhou; Mingli Song

A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

Yunpeng Qing, Shunyu liu, Jingyuan Cong, Kaixuan Chen, Yihe Zhou, Mingli Song

TL;DR

This paper introduces a novel Advantage-Aware Policy Optimization (A2PO) method, which employs a conditional variational auto-encoder to disentangle the action distributions of intertwined behavior policies by modeling the advantage values of all training data as conditional variables.

Abstract

Offline reinforcement learning endeavors to leverage offline datasets to craft effective agent policy without online interaction, which imposes proper conservative constraints with the support of behavior policies to tackle the out-of-distribution problem. However, existing works often suffer from the constraint conflict issue when offline datasets are collected from multiple behavior policies, i.e., different behavior policies may exhibit inconsistent actions with distinct returns across the state space. To remedy this issue, recent advantage-weighted methods prioritize samples with high advantage values for agent training while inevitably ignoring the diversity of behavior policy. In this paper, we introduce a novel Advantage-Aware Policy Optimization (A2PO) method to explicitly construct advantage-aware policy constraints for offline learning under mixed-quality datasets. Specifically, A2PO employs a conditional variational auto-encoder to disentangle the action distributions of intertwined behavior policies by modeling the advantage values of all training data as conditional variables. Then the agent can follow such disentangled action distribution constraints to optimize the advantage-aware policy towards high advantage values. Extensive experiments conducted on both the single-quality and mixed-quality datasets of the D4RL benchmark demonstrate that A2PO yields results superior to the counterparts. Our code is available at https://github.com/Plankson/A2PO

A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

TL;DR

Abstract

Paper Structure (28 sections, 8 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 28 sections, 8 equations, 9 figures, 9 tables, 1 algorithm.

Introduction
Related Works
Preliminaries
Methodology
Behavior Policy Disentangling
Agent Policy Optimization
Experiments
Experiment Settings
Comparison on D4RL Benchmarks
Ablation Analysis
Visualization
Robustness
Time Overhead
Conclusion
Limitations.
...and 13 more sections

Figures (9)

Figure 1: A didactic experiment. (a) The visualization of the toy one-step jump task and the composition of the mixed-quality dataset. The agent starts at position $0$ and can make a one-step jump $a\in[-10,10]$ to reach a new position and receive a reward $r$. (b) Learning curves of A2PO and LAPO. (c) VAE-generated action distributions of A2PO and LAPO at the initial state. LAPO VAE conditions only on the state, while A2PO VAE conditions on both the state and the advantage $\xi$.
Figure 2: An illustrative diagram of the Advantage-Aware Policy Optimization (A2PO) method.
Figure 3: Test return difference of A2PO with different discrete advantage conditions during training compared with original A2PO with continuous advantage condition during training. Task abbreviations are listed in Appendix \ref{['supp::abre']}. Test returns are reported in Appendix \ref{['sup::abl-adv-train']}.
Figure 4: Learning curves of A2PO under different fixed advantage inputs during the test while using the original continuous advantage condition for training. Test returns are reported in Appendix \ref{['sup::abl-adv-test']}.
Figure 5: Visualization of A2PO latent representation after applying PCA with different advantage conditions and actual returns in the walker2d-medium-replay and hopper-medium-replay tasks. Each data point indicates a latent representation $\tilde{z}$ based on the initial state and different advantage conditions sampled uniformly from $[-1,1]$. The actual return is measured under the corresponding sampled advantage condition. The value magnitude is indicated with varying shades of color.
...and 4 more figures

A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

TL;DR

Abstract

A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (9)