Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Fan Zhang; Baoru Huang; Xin Zhang

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Fan Zhang, Baoru Huang, Xin Zhang

TL;DR

A new pessimistic auxiliary policy for sampling reliable actions is constructed by maximizing the lower confidence bound of the Q-function, and utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.

Abstract

Offline reinforcement learning aims to learn an agent from pre-collected datasets, avoiding unsafe and inefficient real-time interaction. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation. In this paper, we construct a new pessimistic auxiliary policy for sampling reliable actions. Specifically, we develop a pessimistic auxiliary strategy by maximizing the lower confidence bound of the Q-function. The pessimistic auxiliary strategy exhibits a relatively high value and low uncertainty in the vicinity of the learned policy, avoiding the learned policy sampling high-value actions with potentially high errors during the learning process. Less approximation error introduced by sampled action from pessimistic auxiliary strategy leads to the alleviation of error accumulation. Extensive experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

TL;DR

Abstract

Paper Structure (13 sections, 3 theorems, 23 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 3 theorems, 23 equations, 4 figures, 3 tables, 1 algorithm.

Introduction
Introduction
Related Work
Preliminaries
Analysis of Overestimation
Pessimistic Auxiliary Policy
Offline RL with Pessimistic Auxiliary Policy
Convergence Analysis
Experiments
Experiment Settings
Overall Performance (RQ1).
Performance Analysis (RQ2).
Conclusion

Key Result

Proposition 1

Pessimistic auxiliary policy can be defined as $\pi_p=\delta(\mu_p)$, where

Figures (4)

Figure 1: Simulation study shows the causes of estimation error of linear regression.
Figure 2: Depiction of lower confidence bound of Q function and pessimistic auxiliary policy.
Figure 3: The distances between the action executed by policy and the action in the HalfCheetah dataset.
Figure 4: The distances between the action executed by policy and the action in the AntMaze dataset.

Theorems & Definitions (6)

Proposition 1
proof
Proposition 2
proof
Proposition 3
proof

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

TL;DR

Abstract

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)