Table of Contents
Fetching ...

Learning from Sparse Offline Datasets via Conservative Density Estimation

Zhepeng Cen, Zuxin Liu, Zitong Wang, Yihang Yao, Henry Lam, Ding Zhao

TL;DR

A novel training algorithm called Conservative Density Estimation (CDE), which addresses this challenge by explicitly imposing constraints on the state-action occupancy stationary distribution by addressing the support mismatch issue in marginal importance sampling.

Abstract

Offline reinforcement learning (RL) offers a promising direction for learning policies from pre-collected datasets without requiring further interactions with the environment. However, existing methods struggle to handle out-of-distribution (OOD) extrapolation errors, especially in sparse reward or scarce data settings. In this paper, we propose a novel training algorithm called Conservative Density Estimation (CDE), which addresses this challenge by explicitly imposing constraints on the state-action occupancy stationary distribution. CDE overcomes the limitations of existing approaches, such as the stationary distribution correction method, by addressing the support mismatch issue in marginal importance sampling. Our method achieves state-of-the-art performance on the D4RL benchmark. Notably, CDE consistently outperforms baselines in challenging tasks with sparse rewards or insufficient data, demonstrating the advantages of our approach in addressing the extrapolation error problem in offline RL.

Learning from Sparse Offline Datasets via Conservative Density Estimation

TL;DR

A novel training algorithm called Conservative Density Estimation (CDE), which addresses this challenge by explicitly imposing constraints on the state-action occupancy stationary distribution by addressing the support mismatch issue in marginal importance sampling.

Abstract

Offline reinforcement learning (RL) offers a promising direction for learning policies from pre-collected datasets without requiring further interactions with the environment. However, existing methods struggle to handle out-of-distribution (OOD) extrapolation errors, especially in sparse reward or scarce data settings. In this paper, we propose a novel training algorithm called Conservative Density Estimation (CDE), which addresses this challenge by explicitly imposing constraints on the state-action occupancy stationary distribution. CDE overcomes the limitations of existing approaches, such as the stationary distribution correction method, by addressing the support mismatch issue in marginal importance sampling. Our method achieves state-of-the-art performance on the D4RL benchmark. Notably, CDE consistently outperforms baselines in challenging tasks with sparse rewards or insufficient data, demonstrating the advantages of our approach in addressing the extrapolation error problem in offline RL.
Paper Structure (41 sections, 8 theorems, 56 equations, 6 figures, 14 tables, 2 algorithms)

This paper contains 41 sections, 8 theorems, 56 equations, 6 figures, 14 tables, 2 algorithms.

Key Result

Proposition 1

With assumption ass:f, the closed-form solution to inner maximization problem $\max_{w\geq 0} {\mathcal{L}}'(w,v,\lambda)$ is where $\tilde{A}(s,a) := A(s,a) - \bm{1}\{(s,a) \in \text{supp}(\mu)\} \cdot\lambda(s,a)$ denotes regularized advantage function and $\bm{1}\{\cdot\}$ is the indicator function.

Figures (6)

  • Figure 1: The results on sub-datasets with different dataset sizes.
  • Figure 2: The heatmaps of agents with different levels of conservatism in maze2d-large environment. Yellow denotes the high occupation probability. The starting point of each trajectory may vary but the destination (red star) is the same. Smaller $\tilde{\epsilon}$ indicates more conservative policy. The yellow accumulation points except the destination indicate that the agent is stuck at those regions.
  • Figure 3: The performances with different $\zeta$.
  • Figure 4: The training curves of CDE. The shadow region indicates the standard deviation of mean values across different seeds. Here we report the normalized reward scores for MuJoCo tasks measured by dense rewards instead of success rate, which has been reported in previous tables.
  • Figure 5: The results on sub-datasets with different dataset sizes for MuJoCo medium-expert tasks.
  • ...and 1 more figures

Theorems & Definitions (17)

  • Proposition 1
  • Proposition 2
  • Proposition 3: Upper bound of concentrability ratio on OOD state-actions
  • Theorem 1: Upper bound of function approximated concentrability ratio
  • Theorem 2: The upper bound of performance gap
  • proof
  • proof
  • proof
  • proof
  • proof
  • ...and 7 more