Table of Contents
Fetching ...

Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption

Longxiang He, Deheng Ye, Junbo Tan, Xueqian Wang, Li Shen

TL;DR

This work addresses robustness in Offline-to-Online Reinforcement Learning under diverse data corruption by introducing RPEX, which integrates Inverse Probability Weighting into Policy Expansion to counteract heavy-tailed policy behavior caused by corrupted data. Theoretical analysis justifies how IPW mitigates tail risks and enhances exploration, while extensive experiments on D4RL demonstrate state-of-the-art O2O performance across corruption types. Key contributions include the first robust O2O method for joint offline-online corruption, a principled IPW-based framework, and comprehensive ablations highlighting practical considerations such as normalization and policy extraction. The results have practical impact for deploying O2O RL in real-world, noisy environments where data integrity cannot be guaranteed.

Abstract

Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed $\textbf{RPEX}$: $\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios. Code is available at $\href{https://github.com/felix-thu/RPEX}{https://github.com/felix-thu/RPEX}$.

Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption

TL;DR

This work addresses robustness in Offline-to-Online Reinforcement Learning under diverse data corruption by introducing RPEX, which integrates Inverse Probability Weighting into Policy Expansion to counteract heavy-tailed policy behavior caused by corrupted data. Theoretical analysis justifies how IPW mitigates tail risks and enhances exploration, while extensive experiments on D4RL demonstrate state-of-the-art O2O performance across corruption types. Key contributions include the first robust O2O method for joint offline-online corruption, a principled IPW-based framework, and comprehensive ablations highlighting practical considerations such as normalization and policy extraction. The results have practical impact for deploying O2O RL in real-world, noisy environments where data integrity cannot be guaranteed.

Abstract

Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed : obust olicy pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios. Code is available at .

Paper Structure

This paper contains 27 sections, 1 theorem, 19 equations, 9 figures, 6 tables, 1 algorithm.

Key Result

Proposition 6.1

Given $P_{\mathbf{w}}({\bm{a}}_i|{\bm{a}}_1,{\bm{a}}_2)$, where ${\bm{a}}_1\sim\pi_\beta$,${\bm{a}}_2\sim\pi_\theta$,Eq. (rpex) maximizes the following objective (See proof in Appendix derivation_rpex.)

Figures (9)

  • Figure 1: (a) Problem Statement. A schematic illustration of the O2O attack, in which both the offline pre-training phase and the online fine-tuning phase are targeted. (b) The Kurtosis Values garg2021proximalmardia1970measures of Policies. CLEAN means IQL is trained without attacks. In contrast, RIQL and IQL are trained on the attacked datasets.
  • Figure 2: We study the impact of policy heavy-tailedness in the grid-world domain. An offline policy is trained using the dataset shown in Figure \ref{['fig: motivation']}(a) and is then used to collect trajectories during the online exploration phase under both corrupted and uncorrupted settings. In Figures \ref{['fig: motivation']} (b)–(e), the opacity of the green arrows indicates the selection probability. Red arrows denote the most probable trajectory generated by IQL or IQL+IPW under the respective conditions. Specifically, panel (a) illustrates the dataset transitions; panels (b) and (d) show trajectories selected by IQL under clean and corrupted value functions, respectively; panels (c) and (e) show trajectories selected by IQL+IPW under clean and corrupted value functions, respectively.
  • Figure 3: Action distributions generated by the offline pretrained policy under reward attack on the Halfcheetah-MR task. (a) Action distributions of RIQL under attack. (b) Action distributions of RIQL+IPW under attack. (c) Action distributions of IQL without attack. (d) Comparison of RPEX (with IPW) against RIQL-PEX (without IPW) and RIQL (Vanilla RIQL).
  • Figure 4: The effect of UTD for O2O methods. The results are averaged over the 5 random seeds on the Medium Replay tasks.
  • Figure 5: Ablation study of the main components of RPEX on Hopper MR task. From left to right: policy type ablation, state normalization ablation, and hyperparameter $\kappa$ ablation.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Proposition 6.1