Robust Offline Reinforcement learning with Heavy-Tailed Rewards

Jin Zhu; Runzhe Wan; Zhengling Qi; Shikai Luo; Chengchun Shi

Robust Offline Reinforcement learning with Heavy-Tailed Rewards

Jin Zhu, Runzhe Wan, Zhengling Qi, Shikai Luo, Chengchun Shi

TL;DR

This work addresses robust offline reinforcement learning when rewards are heavy-tailed by introducing two MM-based frameworks: ROAM for off-policy evaluation and ROOM for offline policy optimization. By partitioning data into $K$ folds and applying the median-of-means operator to Q-estimates, these methods achieve robust uncertainty quantification and naturally incorporate pessimism to mitigate data-coverage issues under heavy tails. Theoretical results show error bounds under only finite $(1+\alpha)$-th moments, and empirical results across Cartpole and MuJoCo/D4RL benchmarks demonstrate substantial improvements over standard OPE/OPO baselines and state-of-the-art methods. The proposed approach provides practical robustness to heavy-tailed rewards and offers a straightforward uncertainty quantification mechanism, with available code at the provided repository.

Abstract

This paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavy-tailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavy-tailed reward distributions. The implementation of the proposal is available at https://github.com/Mamba413/ROOM.

Robust Offline Reinforcement learning with Heavy-Tailed Rewards

TL;DR

folds and applying the median-of-means operator to Q-estimates, these methods achieve robust uncertainty quantification and naturally incorporate pessimism to mitigate data-coverage issues under heavy tails. Theoretical results show error bounds under only finite

-th moments, and empirical results across Cartpole and MuJoCo/D4RL benchmarks demonstrate substantial improvements over standard OPE/OPO baselines and state-of-the-art methods. The proposed approach provides practical robustness to heavy-tailed rewards and offers a straightforward uncertainty quantification mechanism, with available code at the provided repository.

Abstract

Paper Structure (31 sections, 6 theorems, 23 equations, 11 figures, 3 tables, 5 algorithms)

This paper contains 31 sections, 6 theorems, 23 equations, 11 figures, 3 tables, 5 algorithms.

INTRODUCTION
Contribution
RELATED WORKS
PRELIMINARIES
MM FOR ROBUST OFFLINE RL
MM for OPE
MM for OPO with Pessimism
THEORY
EXPERIMENTS
CONCLUSIONS AND FUTURE WORKS
THEORETICAL PROOF
Proof of Theorem \ref{['thm:MIS']}
Proof of Theorem \ref{['thm:DM']}
Proof of Theorem \ref{['thm:pess-q']}
ALGORITHM DETAILS
...and 16 more sections

Key Result

Proposition 1

Suppose $R_1, \ldots, R_n$ are i.i.d. with mean $\mu$ and the $(1+\alpha)$ th moment. For any $\delta \in(0,1)$, by setting $K=\lceil 8 \log (2 / \delta)\rceil$, we have with probability at least $1-\delta$ that for some constant $C>0$.

Figures (11)

Figure 1: Reward distributions in a two-armed bandit example. The oracle expected rewards for the two arms $a_k$ are given by $r_k$ (for $k =1, 2$). $N(a_k)$ denotes the number of reward observations for the $k$-arm. The expected rewards estimator is given by $\widehat{r}_k$. Due to the limited sample size for the second sub-optimal arm, its estimated expected reward $\widehat{r}_2$ suffers from a large variance. Consequently, there's a non-negligible probability of $\widehat{r}_2 > \widehat{r}_1$. By penalizing the uncertainty of reward estimation, a pessimistic estimation $\widehat{r}^L_k$ lowers bound the reward, leading to $\widehat{r}_2^L<\widehat{r}_1^L$, yielding the optimal action.
Figure 2: Graphical illustration for ROAM. $Q_{\mathrm{MM}}(s, a)$ is equal to $\operatorname{Median}(\{\widehat{Q}_k^\pi(s, a)\}_{k=1}^K)$.
Figure 3: (a) OPE task: the trend of log(MSE) with the degree of freedom (DF). (b) OPO task: The trend of regret with respect to the DF. $\kappa$ takes value 1.0 (Left panel) and 2.0 (Right panel) in each subfigure. The error bar corresponds to 95% CI.
Figure 4: The left panel presents the results for DM methods, and the right one displays the results for MIS methods. To prevent point overlap, random noise has been added to each point on the $x$-axis.
Figure 5: Results on D4RL datasets. Each bar corresponds to the average normalized score that is taken over the final 10 evaluations and 5 seeds. The error bar captures the 2 times standard error over 5 seeds.
...and 6 more figures

Theorems & Definitions (10)

Definition 1: Population mean estimation via MM
Proposition 1: lugosi2019mean, Theorem 3
Theorem 1
Theorem 2
Theorem 3
proof
proof
Lemma 1: bubeck2013bandits
Lemma 2
proof

Robust Offline Reinforcement learning with Heavy-Tailed Rewards

TL;DR

Abstract

Robust Offline Reinforcement learning with Heavy-Tailed Rewards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (10)