Table of Contents
Fetching ...

Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning

Tenglong Liu, Yang Li, Yixing Lan, Hao Gao, Wei Pan, Xin Xu

TL;DR

This work tackles offline RL with out-of-distribution actions by reducing unnecessary conservatism through Adaptive Advantage-Guided Policy Regularization (A2PR). It combines a VAE-based augmented behavior policy with an advantage-guided mechanism to selectively emphasize high-advantage actions, while enforcing adaptive constraints to prevent excessive divergence from the policy. The authors provide theoretical guarantees for behavior policy improvement and a bounded performance gap, and demonstrate state-of-the-art results on the D4RL benchmark as well as strong generalization on suboptimal datasets. The approach offers a practical, efficient, and theoretically justified path to safer and more capable offline RL policies.

Abstract

In offline reinforcement learning, the challenge of out-of-distribution (OOD) is pronounced. To address this, existing methods often constrain the learned policy through policy regularization. However, these methods often suffer from the issue of unnecessary conservativeness, hampering policy improvement. This occurs due to the indiscriminate use of all actions from the behavior policy that generates the offline dataset as constraints. The problem becomes particularly noticeable when the quality of the dataset is suboptimal. Thus, we propose Adaptive Advantage-guided Policy Regularization (A2PR), obtaining high-advantage actions from an augmented behavior policy combined with VAE to guide the learned policy. A2PR can select high-advantage actions that differ from those present in the dataset, while still effectively maintaining conservatism from OOD actions. This is achieved by harnessing the VAE capacity to generate samples matching the distribution of the data points. We theoretically prove that the improvement of the behavior policy is guaranteed. Besides, it effectively mitigates value overestimation with a bounded performance gap. Empirically, we conduct a series of experiments on the D4RL benchmark, where A2PR demonstrates state-of-the-art performance. Furthermore, experimental results on additional suboptimal mixed datasets reveal that A2PR exhibits superior performance. Code is available at https://github.com/ltlhuuu/A2PR.

Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning

TL;DR

This work tackles offline RL with out-of-distribution actions by reducing unnecessary conservatism through Adaptive Advantage-Guided Policy Regularization (A2PR). It combines a VAE-based augmented behavior policy with an advantage-guided mechanism to selectively emphasize high-advantage actions, while enforcing adaptive constraints to prevent excessive divergence from the policy. The authors provide theoretical guarantees for behavior policy improvement and a bounded performance gap, and demonstrate state-of-the-art results on the D4RL benchmark as well as strong generalization on suboptimal datasets. The approach offers a practical, efficient, and theoretically justified path to safer and more capable offline RL policies.

Abstract

In offline reinforcement learning, the challenge of out-of-distribution (OOD) is pronounced. To address this, existing methods often constrain the learned policy through policy regularization. However, these methods often suffer from the issue of unnecessary conservativeness, hampering policy improvement. This occurs due to the indiscriminate use of all actions from the behavior policy that generates the offline dataset as constraints. The problem becomes particularly noticeable when the quality of the dataset is suboptimal. Thus, we propose Adaptive Advantage-guided Policy Regularization (A2PR), obtaining high-advantage actions from an augmented behavior policy combined with VAE to guide the learned policy. A2PR can select high-advantage actions that differ from those present in the dataset, while still effectively maintaining conservatism from OOD actions. This is achieved by harnessing the VAE capacity to generate samples matching the distribution of the data points. We theoretically prove that the improvement of the behavior policy is guaranteed. Besides, it effectively mitigates value overestimation with a bounded performance gap. Empirically, we conduct a series of experiments on the D4RL benchmark, where A2PR demonstrates state-of-the-art performance. Furthermore, experimental results on additional suboptimal mixed datasets reveal that A2PR exhibits superior performance. Code is available at https://github.com/ltlhuuu/A2PR.
Paper Structure (39 sections, 8 theorems, 31 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 39 sections, 8 theorems, 31 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Proposition 4.1

Suppose that $A^{\pi_\beta}(s,a)(\hat{\pi}_\beta(a|s)-\pi_\beta(a|s)) \geq 0$. Then, we have

Figures (7)

  • Figure 1: All trajectories and the trajectories from the final $100,000$ steps of the trained policy for both A2PR and TD3+BC.
  • Figure 2: The performance profiles of reliable evaluation on D4RL based on 18 tasks and 5 random seeds for each task and the comparison between estimated Q-value and true Q-value of different methods.
  • Figure 3: The performance of different methods in the mixed policy datasets and the comprehensive ablation study of A2PR on halfcheetah-medium-v2 with different components.
  • Figure 4: Results of performance comparisons conducted on nine original tasks in the D4RL dataset. The lines and shaded areas indicate the averages and standard deviations calculated over 5 random seeds, respectively.
  • Figure 5: Reliable evaluation for statistical uncertainty on D4RL with 95% CIs based on 18 tasks and 5 random seeds for each task.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Proposition 4.1
  • Proposition 4.3: Behavior Policy Improvement Guarantee
  • Theorem 4.4
  • Theorem 4.5: Performance Gap of A2PR
  • Lemma 1.1
  • proof
  • Proposition 1.2
  • proof
  • proof
  • Lemma 1.3: Triangle inequality
  • ...and 4 more