Table of Contents
Fetching ...

Decoupled Prioritized Resampling for Offline RL

Yang Yue, Bingyi Kang, Xiao Ma, Qisen Yang, Gao Huang, Shiji Song, Shuicheng Yan

TL;DR

This work tackles distributional shift in offline reinforcement learning by decoupling data resampling from policy evaluation and introducing prioritized resampling that favors higher-quality actions. The proposed Offline Decoupled Prioritized Resampling (ODPR) framework, with ODPR-A (advantage-based) and ODPR-R (return-based) variants, reshapes the behavior policy used for policy improvement and constraint, while keeping policy evaluation grounded on a uniform sample. The authors prove that a prioritized behavior policy improves performance and demonstrate that fine-grained, trajectory-sensitive priorities yield consistent gains across multiple baselines on the D4RL suite, including robustness to noisy value estimates. ODPR updates are plug-and-play, scalable, and show substantial practical impact by enhancing existing offline RL methods without requiring interactions with the environment, making offline learning more effective in diverse, real-world datasets.

Abstract

Offline reinforcement learning (RL) is challenged by the distributional shift problem. To address this problem, existing works mainly focus on designing sophisticated policy constraints between the learned policy and the behavior policy. However, these constraints are applied equally to well-performing and inferior actions through uniform sampling, which might negatively affect the learned policy. To alleviate this issue, we propose Offline Prioritized Experience Replay (OPER), featuring a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training. Through theoretical analysis, we show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution. We develop two practical strategies to obtain priority weights by estimating advantages based on a fitted value network (OPER-A) or utilizing trajectory returns (OPER-R) for quick computation. OPER is a plug-and-play component for offline RL algorithms. As case studies, we evaluate OPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and IQL. Extensive experiments demonstrate that both OPER-A and OPER-R significantly improve the performance for all baseline methods. Codes and priority weights are availiable at https://github.com/sail-sg/OPER.

Decoupled Prioritized Resampling for Offline RL

TL;DR

This work tackles distributional shift in offline reinforcement learning by decoupling data resampling from policy evaluation and introducing prioritized resampling that favors higher-quality actions. The proposed Offline Decoupled Prioritized Resampling (ODPR) framework, with ODPR-A (advantage-based) and ODPR-R (return-based) variants, reshapes the behavior policy used for policy improvement and constraint, while keeping policy evaluation grounded on a uniform sample. The authors prove that a prioritized behavior policy improves performance and demonstrate that fine-grained, trajectory-sensitive priorities yield consistent gains across multiple baselines on the D4RL suite, including robustness to noisy value estimates. ODPR updates are plug-and-play, scalable, and show substantial practical impact by enhancing existing offline RL methods without requiring interactions with the environment, making offline learning more effective in diverse, real-world datasets.

Abstract

Offline reinforcement learning (RL) is challenged by the distributional shift problem. To address this problem, existing works mainly focus on designing sophisticated policy constraints between the learned policy and the behavior policy. However, these constraints are applied equally to well-performing and inferior actions through uniform sampling, which might negatively affect the learned policy. To alleviate this issue, we propose Offline Prioritized Experience Replay (OPER), featuring a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training. Through theoretical analysis, we show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution. We develop two practical strategies to obtain priority weights by estimating advantages based on a fitted value network (OPER-A) or utilizing trajectory returns (OPER-R) for quick computation. OPER is a plug-and-play component for offline RL algorithms. As case studies, we evaluate OPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and IQL. Extensive experiments demonstrate that both OPER-A and OPER-R significantly improve the performance for all baseline methods. Codes and priority weights are availiable at https://github.com/sail-sg/OPER.
Paper Structure (37 sections, 2 theorems, 11 equations, 9 figures, 17 tables, 2 algorithms)

This paper contains 37 sections, 2 theorems, 11 equations, 9 figures, 17 tables, 2 algorithms.

Key Result

Lemma 2.1

(Performance Difference Lemma kakade2002approximately.) For any policy $\pi$ and $\beta$, where $d_\pi({\boldsymbol{s}}) = \sum_{t=0}^\infty \gamma^t p({\boldsymbol{s}}_t = {\boldsymbol{s}} | \pi)$, represents the unnormalized discounted state marginal distribution induced by the policy $\pi$, and $p({\boldsymbol{s}}_t = {\boldsymbol{s}} | \pi)$ is the probability of the state ${\boldsy

Figures (9)

  • Figure 1: (a) Prioritized resampling. Given a state, possible actions are ranked by quality in x-axis. A behavior policy (in blue) usually follows a multi-modal distribution. A prioritized policy (in red) is acquired by prioritized resampling which assigns higher weights to better actions. (b) Offline RL (i) v.s. Offline RL with ODPR (ii). Different from the vanilla offline RL, ODPR obtains a sequence of better behavior policy by iteratively resampling the current dataset. Then policy-constrained offline RL algorithms are performed on the superior dataset $\mathcal{D}^K$. (c) The average score of popular offline RL algorithms. ODPR statistically boost the performance of these algorithms on standard D4RL benchmark and curated mixed datasets abundant with suboptimal data. Further, ODPR-A is effective within datasets without trajectory, where preceding trajectory-based resampling strategies hong2023harnessingchen2021decision failed.
  • Figure 2: A subset of tasks pertinent to real-world applications from the D4RL benchmark fu2020d4rl, including Mujoco locmotion tasks with bipeds or quadrupeds, Maze navigation tasks with an 8-DoF Ant quadruped robot, Kitchen tasks with a 9-DoF Franka robot, and Adroit tasks with a 24-DoF Hand robot.
  • Figure 3: Visualization of the effect of ODPR on prioritized behavior policies in the bandit setting. As the iterations progress, the suboptimal red, green, and blue samples noticeably decrease, leading to the prioritized dataset progressively converging towards the optimal action mode, represented by the yellow color. The value enclosed in parentheses denotes the average reward.
  • Figure 4: The left figure represents TD3+BC learning on the original dataset, which failed to find the optimal action. The middle and right figures depict the learned policies obtained by running TD3+BC on the first and fifth prioritized datasets, respectively. These policies demonstrate convergence towards nearly optimal and optimal modes, respectively.
  • Figure 5: Trajectory Return Distributions of hopper-medium-replay (left) and hopper-medium-expert (right). Medium-replay datasets usually have a long-tailed distribution, and medium-expert often display two peaks. Both are composed of policies with varying quality.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Lemma 2.1
  • Theorem 2.2