Table of Contents
Fetching ...

Exclusively Penalized Q-learning for Offline Reinforcement Learning

Junghyuk Yeom, Yonghyeon Jo, Jungmo Kim, Sanghyeon Lee, Seungyul Han

TL;DR

This paper proposes Exclusively Penalized Q-learning (EPQ), which reduces estimation bias in the value function by selectively penalizing states that are prone to inducing estimation errors, and improves performance in various offline control tasks compared to other offline RL methods.

Abstract

Constraint-based offline reinforcement learning (RL) involves policy constraints or imposing penalties on the value function to mitigate overestimation errors caused by distributional shift. This paper focuses on a limitation in existing offline RL methods with penalized value function, indicating the potential for underestimation bias due to unnecessary bias introduced in the value function. To address this concern, we propose Exclusively Penalized Q-learning (EPQ), which reduces estimation bias in the value function by selectively penalizing states that are prone to inducing estimation errors. Numerical results show that our method significantly reduces underestimation bias and improves performance in various offline control tasks compared to other offline RL methods

Exclusively Penalized Q-learning for Offline Reinforcement Learning

TL;DR

This paper proposes Exclusively Penalized Q-learning (EPQ), which reduces estimation bias in the value function by selectively penalizing states that are prone to inducing estimation errors, and improves performance in various offline control tasks compared to other offline RL methods.

Abstract

Constraint-based offline reinforcement learning (RL) involves policy constraints or imposing penalties on the value function to mitigate overestimation errors caused by distributional shift. This paper focuses on a limitation in existing offline RL methods with penalized value function, indicating the potential for underestimation bias due to unnecessary bias introduced in the value function. To address this concern, we propose Exclusively Penalized Q-learning (EPQ), which reduces estimation bias in the value function by selectively penalizing states that are prone to inducing estimation errors. Numerical results show that our method significantly reduces underestimation bias and improves performance in various offline control tasks compared to other offline RL methods
Paper Structure (33 sections, 1 theorem, 19 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 1 theorem, 19 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 3.1

We denote the $Q$-function converged from the $Q$-update of EPQ using the proposed penalty $\mathcal{P}_\tau$ in eq:bellmanours by $\hat{Q}^\pi$. Then, the expected value of $\hat{Q}^\pi$ underestimates the expected true policy value, i.e., $\mathbb{E}_{a\sim\pi}[\hat{Q}^\pi(s,a)] \leq \mathbb{E}_{a

Figures (8)

  • Figure 1: Histograms of $\pi$ and $\hat{\beta}$ (left axis), and the estimation bias of CQL with various $\alpha$ (right axis) at $s_0$ for three cases: (a) $\beta= \textrm{Unif}(-2,2)$ and $\pi= N(0,0.2)$ (b) $\beta= \frac{1}{2} N(-1,0.3) + \frac{1}{2} N(1,0.3)$ and $\pi= N(1,0.2)$ (c) $\beta= \frac{1}{2} N(-1,0.3) + \frac{1}{2} N(1,0.3)$ and $\pi= N(0,0.2)$, where $\textrm{Unif}(-2,2)$ represents a uniform distribution and $N(\mu,\sigma)$ denotes a Gaussian distribution with mean $\mu$ and standard deviation $\sigma$.
  • Figure 2: An illustration of our exclusive penalty: (a) The log-probability of $\hat{\beta}$ and the thresholds $\tau_1$ and $\tau_2$ according to the number of data samples $N_1$ and $N_2$, where $N_1 << N_2$. (b) The penalty adaptation factor $f^{\pi,\hat{\beta}}_\tau$ which represents the amount of adaptive penalty, indicating how much $\log\hat{\beta}$ exceeds the threshold $\tau$. Three different policies $\pi_i,~i=1,2,3$, are considered.
  • Figure 3: Histogram of $\hat{\beta}$ (left axis), and the corresponding $f_\tau^{\pi,\hat{\beta}}(s)$ with various $\tau$ (right axis) for two cases: (a) $\beta= \textrm{Unif}(-2,2)$ (b) $\beta= \frac{1}{2} N(-1,0.3) + \frac{1}{2} N(1,0.3)$
  • Figure 4: Histograms of $\pi$ and $\hat{\beta}$ (left axis), and the estimation bias of CQL and EPQ with various $\tau$ (right axis) for three cases: (a) $\beta= \textrm{Unif}(-2,2)$ and $\pi= N(0,0.2)$ (b) $\beta= \frac{1}{2} N(-1,0.3) + \frac{1}{2} N(1,0.3)$ and $\pi= N(1,0.2)$ (c) $\beta= \frac{1}{2} N(-1,0.3) + \frac{1}{2} N(1,0.3)$ and $\pi= N(0,0.2)$.
  • Figure 5: An illustration of the prioritized dataset. As the policy focuses on actions with maximum $Q$-values, the difference between $\hat{\beta}$ and $\pi$ becomes substantial, inducing large penalty: (a) The change of data distribution from $\hat{\beta}$ (w/o PD) to $\hat{\beta}^Q$ (with PD) (b) The corresponding penalty graphs for $\hat{\beta}$ (w/o PD) and $\hat{\beta}^Q$ (with PD).
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 3.1