Table of Contents
Fetching ...

Grid-Mapping Pseudo-Count Constraint for Offline Reinforcement Learning

Yi Shen, Hanyan Huang

TL;DR

A novel pseudo-count method for continuous environments called grid-mapping pseudo-count method (GPC) is proposed by extending the count-based method from discrete to continuous environments and is combined with soft actor-critic algorithm (SAC) to create a novel algorithm called GPC-SAC.

Abstract

Offline reinforcement learning learns from a static dataset without interacting with environments, which ensures security and thus owns a good application prospect. However, directly applying naive reinforcement learning algorithm usually fails in an offline environment due to inaccurate Q value approximation caused by out-of-distribution (OOD) state-actions. It is an effective way to solve this problem by penalizing the Q-value of OOD state-actions. Among the methods of punishing OOD state-actions, count-based methods have achieved good results in discrete domains in a simple form. Inspired by it, a novel pseudo-count method for continuous domains called Grid-Mapping Pseudo-Count method (GPC) is proposed by extending the count-based method from discrete to continuous domains. Firstly, the continuous state and action space are mapped to discrete space using Grid-Mapping, then the Q-values of OOD state-actions are constrained through pseudo-count. Secondly, the theoretical proof is given to show that GPC can obtain appropriate uncertainty constraints under fewer assumptions than other pseudo-count methods. Thirdly, GPC is combined with Soft Actor-Critic algorithm (SAC) to get a new algorithm called GPC-SAC. Lastly, experiments on D4RL datasets are given to show that GPC-SAC has better performance and less computational cost than other algorithms that constrain the Q-value.

Grid-Mapping Pseudo-Count Constraint for Offline Reinforcement Learning

TL;DR

A novel pseudo-count method for continuous environments called grid-mapping pseudo-count method (GPC) is proposed by extending the count-based method from discrete to continuous environments and is combined with soft actor-critic algorithm (SAC) to create a novel algorithm called GPC-SAC.

Abstract

Offline reinforcement learning learns from a static dataset without interacting with environments, which ensures security and thus owns a good application prospect. However, directly applying naive reinforcement learning algorithm usually fails in an offline environment due to inaccurate Q value approximation caused by out-of-distribution (OOD) state-actions. It is an effective way to solve this problem by penalizing the Q-value of OOD state-actions. Among the methods of punishing OOD state-actions, count-based methods have achieved good results in discrete domains in a simple form. Inspired by it, a novel pseudo-count method for continuous domains called Grid-Mapping Pseudo-Count method (GPC) is proposed by extending the count-based method from discrete to continuous domains. Firstly, the continuous state and action space are mapped to discrete space using Grid-Mapping, then the Q-values of OOD state-actions are constrained through pseudo-count. Secondly, the theoretical proof is given to show that GPC can obtain appropriate uncertainty constraints under fewer assumptions than other pseudo-count methods. Thirdly, GPC is combined with Soft Actor-Critic algorithm (SAC) to get a new algorithm called GPC-SAC. Lastly, experiments on D4RL datasets are given to show that GPC-SAC has better performance and less computational cost than other algorithms that constrain the Q-value.
Paper Structure (31 sections, 4 theorems, 73 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 31 sections, 4 theorems, 73 equations, 9 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

When selected appropriate hyperparameter $\alpha$, $u(s,a) = \alpha \sqrt {\frac{{\ln{T}}}{{n(s,a)}}}$ is a suitable uncertain constraint in discrete offline RL.

Figures (9)

  • Figure 1: obtaining pseudo-counting through GPC
  • Figure 1: The ablation on the learning rate
  • Figure 2: Gym training curve
  • Figure 2: The ablation on the Q-function update
  • Figure 3: Gym training curve
  • ...and 4 more figures

Theorems & Definitions (11)

  • Lemma 1
  • Definition 1
  • Definition 2
  • Lemma 2
  • Corollary 1
  • Theorem 1
  • proof
  • proof
  • proof
  • proof
  • ...and 1 more