Multi-level Certified Defense Against Poisoning Attacks in Offline Reinforcement Learning
Shijie Liu, Andrew C. Cullen, Paul Montague, Sarah Erfani, Benjamin I. P. Rubinstein
TL;DR
This paper tackles poisoning attacks in offline reinforcement learning by introducing MuCD, a multi-level certified defense that leverages differential privacy (DP) to provide robust guarantees for both per-state actions and the overall expected cumulative reward. The framework supports both trajectory-level and transition-level poisoning and delivers action-level and policy-level certifications through two DP-based mechanisms: a randomized training process using DP principles and post-processing–consistent certification bounds via ADP and Rényi-DP. Empirically, MuCD outperforms prior approaches (notably COPA) by achieving larger certified radii and tolerating higher poisoning fractions (up to around $7\%$) while maintaining substantial portions of the original performance, across discrete and continuous action spaces and stochastic/deterministic environments. The results highlight the practical potential of DP-driven certified defenses to bolster safety and reliability in offline RL deployments.
Abstract
Similar to other machine learning frameworks, Offline Reinforcement Learning (RL) is shown to be vulnerable to poisoning attacks, due to its reliance on externally sourced datasets, a vulnerability that is exacerbated by its sequential nature. To mitigate the risks posed by RL poisoning, we extend certified defenses to provide larger guarantees against adversarial manipulation, ensuring robustness for both per-state actions, and the overall expected cumulative reward. Our approach leverages properties of Differential Privacy, in a manner that allows this work to span both continuous and discrete spaces, as well as stochastic and deterministic environments -- significantly expanding the scope and applicability of achievable guarantees. Empirical evaluations demonstrate that our approach ensures the performance drops to no more than $50\%$ with up to $7\%$ of the training data poisoned, significantly improving over the $0.008\%$ in prior work~\citep{wu_copa_2022}, while producing certified radii that is $5$ times larger as well. This highlights the potential of our framework to enhance safety and reliability in offline RL.
