Table of Contents
Fetching ...

Offline Reinforcement Learning with Imbalanced Datasets

Li Jiang, Sijie Cheng, Jielin Qiu, Haoran Xu, Wai Kin Chan, Zhao Ding

TL;DR

This paper proposes a novel offline RL method that utilizes the augmentation of CQL with a retrieval process to recall past related experiences, effectively alleviating the challenges posed by imbalanced datasets.

Abstract

The prevalent use of benchmarks in current offline reinforcement learning (RL) research has led to a neglect of the imbalance of real-world dataset distributions in the development of models. The real-world offline RL dataset is often imbalanced over the state space due to the challenge of exploration or safety considerations. In this paper, we specify properties of imbalanced datasets in offline RL, where the state coverage follows a power law distribution characterized by skewed policies. Theoretically and empirically, we show that typically offline RL methods based on distributional constraints, such as conservative Q-learning (CQL), are ineffective in extracting policies under the imbalanced dataset. Inspired by natural intelligence, we propose a novel offline RL method that utilizes the augmentation of CQL with a retrieval process to recall past related experiences, effectively alleviating the challenges posed by imbalanced datasets. We evaluate our method on several tasks in the context of imbalanced datasets with varying levels of imbalance, utilizing the variant of D4RL. Empirical results demonstrate the superiority of our method over other baselines.

Offline Reinforcement Learning with Imbalanced Datasets

TL;DR

This paper proposes a novel offline RL method that utilizes the augmentation of CQL with a retrieval process to recall past related experiences, effectively alleviating the challenges posed by imbalanced datasets.

Abstract

The prevalent use of benchmarks in current offline reinforcement learning (RL) research has led to a neglect of the imbalance of real-world dataset distributions in the development of models. The real-world offline RL dataset is often imbalanced over the state space due to the challenge of exploration or safety considerations. In this paper, we specify properties of imbalanced datasets in offline RL, where the state coverage follows a power law distribution characterized by skewed policies. Theoretically and empirically, we show that typically offline RL methods based on distributional constraints, such as conservative Q-learning (CQL), are ineffective in extracting policies under the imbalanced dataset. Inspired by natural intelligence, we propose a novel offline RL method that utilizes the augmentation of CQL with a retrieval process to recall past related experiences, effectively alleviating the challenges posed by imbalanced datasets. We evaluate our method on several tasks in the context of imbalanced datasets with varying levels of imbalance, utilizing the variant of D4RL. Empirical results demonstrate the superiority of our method over other baselines.
Paper Structure (13 sections, 1 theorem, 10 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 13 sections, 1 theorem, 10 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Theorem 3.2

W.h.p. $\ge 1-\delta$, for any prescribed level of safety $\zeta$, the maximum possible policy improvement over choices of $\alpha$, $J\left(\pi\right)-J\left(\beta\right) \leq \zeta^{+}$, where $\zeta^{+}$ is given by: where $h^*$ is a monotonically decreasing function of $\alpha$, and $h(0)=\mathcal{O}(1)$.

Figures (7)

  • Figure 1: Left: The state coverage and task description of the imbalanced dataset in Four-room, where states coverage is featured with the heavy-tail property and states in red rectangle frame have no state coverage. It requires the agent to find a path from the start state (yellow grid) in the first room to the goal state (red grid) in the last room. Center: Policy performance with lower pessimism. The learned policy succeeds in the first room but fails in the last room. Right: Policy performance with higher pessimism. The learned policy succeeds in the last room but fails in the first room.
  • Figure 2: Left: The performance of CQL on Antmaze-medium task with imbalanced dataset. With the increasing imbalance of the given dataset from Easy to Hard+ (increasing $\alpha$), the performance continues to drop, and even fails in the hard+ dataset. Center: TD errors over 500 states from sufficient coverage $d_\beta^+(s)$ and insufficient coverage $d_\beta^-(s)$ on medium-level imbalance, respectively. TD errors from sufficient coverage $d_\beta^+(s)$ are smaller than from insufficient coverage $d_\beta^-(s)$ due to inefficient training. Right: The performance of CQL and CQL with PER on Antmaze-medium task on easy-level imbalance. PER worsens the final performance as the introduction of additional data distributional shift problem.
  • Figure 3: Average normalized scores of RB-CQL against other baselines over the whole training process. Regarding the final checkpoint, RB-CQL reaches the best performance in 10 out of 12 tasks, where an increasing margin with the increasing imbalance in AntMaze.
  • Figure 4: Visualisation of N=5 retrieved from RB-CQL for Mujoco locomotion tasks (first row for Hopper and second row for Walker2d.)
  • Figure 5: The performance of CQL and CQL with PER on Antmaze-medium task on easy to hard+ level imbalance. Except for the hard+ task, CQL with the augmentation of PER worsens the final performance, compared with CQL.)
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 3.1: Differential concentrability.
  • Theorem 3.2: Limited policy improvement via distributional constraints.