Table of Contents
Fetching ...

CAMP in the Odyssey: Provably Robust Reinforcement Learning with Certified Radius Maximization

Derui Wang, Kristen Moore, Diksha Goel, Minjune Kim, Gang Li, Yang Li, Robin Doss, Minhui Xue, Bo Li, Seyit Camtepe, Liming Zhu

TL;DR

CAMP addresses the trade-off between robustness certification and policy utility in robust DRL. It introduces a differentiable surrogate for radius maximization and couples it with a novel policy imitation framework to stabilize training, enabling the learned DQN to achieve higher certified returns at larger radii. The approach yields substantial improvements in both certification and empirical robustness across control and driving-like tasks, while remaining scalable to high-dimensional observations. The work advances provable robustness in DRL by integrating radius-aware learning directly into the training objective, with practical impact for deploying certifiably robust agents in noisy or adversarial environments.

Abstract

Deep reinforcement learning (DRL) has gained widespread adoption in control and decision-making tasks due to its strong performance in dynamic environments. However, DRL agents are vulnerable to noisy observations and adversarial attacks, and concerns about the adversarial robustness of DRL systems have emerged. Recent efforts have focused on addressing these robustness issues by establishing rigorous theoretical guarantees for the returns achieved by DRL agents in adversarial settings. Among these approaches, policy smoothing has proven to be an effective and scalable method for certifying the robustness of DRL agents. Nevertheless, existing certifiably robust DRL relies on policies trained with simple Gaussian augmentations, resulting in a suboptimal trade-off between certified robustness and certified return. To address this issue, we introduce a novel paradigm dubbed \texttt{C}ertified-r\texttt{A}dius-\texttt{M}aximizing \texttt{P}olicy (\texttt{CAMP}) training. \texttt{CAMP} is designed to enhance DRL policies, achieving better utility without compromising provable robustness. By leveraging the insight that the global certified radius can be derived from local certified radii based on training-time statistics, \texttt{CAMP} formulates a surrogate loss related to the local certified radius and optimizes the policy guided by this surrogate loss. We also introduce \textit{policy imitation} as a novel technique to stabilize \texttt{CAMP} training. Experimental results demonstrate that \texttt{CAMP} significantly improves the robustness-return trade-off across various tasks. Based on the results, \texttt{CAMP} can achieve up to twice the certified expected return compared to that of baselines. Our code is available at https://github.com/NeuralSec/camp-robust-rl.

CAMP in the Odyssey: Provably Robust Reinforcement Learning with Certified Radius Maximization

TL;DR

CAMP addresses the trade-off between robustness certification and policy utility in robust DRL. It introduces a differentiable surrogate for radius maximization and couples it with a novel policy imitation framework to stabilize training, enabling the learned DQN to achieve higher certified returns at larger radii. The approach yields substantial improvements in both certification and empirical robustness across control and driving-like tasks, while remaining scalable to high-dimensional observations. The work advances provable robustness in DRL by integrating radius-aware learning directly into the training objective, with practical impact for deploying certifiably robust agents in noisy or adversarial environments.

Abstract

Deep reinforcement learning (DRL) has gained widespread adoption in control and decision-making tasks due to its strong performance in dynamic environments. However, DRL agents are vulnerable to noisy observations and adversarial attacks, and concerns about the adversarial robustness of DRL systems have emerged. Recent efforts have focused on addressing these robustness issues by establishing rigorous theoretical guarantees for the returns achieved by DRL agents in adversarial settings. Among these approaches, policy smoothing has proven to be an effective and scalable method for certifying the robustness of DRL agents. Nevertheless, existing certifiably robust DRL relies on policies trained with simple Gaussian augmentations, resulting in a suboptimal trade-off between certified robustness and certified return. To address this issue, we introduce a novel paradigm dubbed \texttt{C}ertified-r\texttt{A}dius-\texttt{M}aximizing \texttt{P}olicy (\texttt{CAMP}) training. \texttt{CAMP} is designed to enhance DRL policies, achieving better utility without compromising provable robustness. By leveraging the insight that the global certified radius can be derived from local certified radii based on training-time statistics, \texttt{CAMP} formulates a surrogate loss related to the local certified radius and optimizes the policy guided by this surrogate loss. We also introduce \textit{policy imitation} as a novel technique to stabilize \texttt{CAMP} training. Experimental results demonstrate that \texttt{CAMP} significantly improves the robustness-return trade-off across various tasks. Based on the results, \texttt{CAMP} can achieve up to twice the certified expected return compared to that of baselines. Our code is available at https://github.com/NeuralSec/camp-robust-rl.

Paper Structure

This paper contains 24 sections, 10 theorems, 37 equations, 8 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Given a target expected return threshold $\xi$ for the randomized policy, let the perturbed trajectory be ${\mathbf{z}}' = {\mathbf{z}}+\Delta$ and define $P_{\pi}^{{\mathbf{z}}}(C) := \Pr[ F_{\pi}({\mathbf{z}}) \geq C ]$. Let ${\mathbf{R}} = \{{\mathbf{r}}_1,...,{\mathbf{r}}_m\}$ represent a set of then $\mathbb{E}[F_{\pi}({\mathbf{z}}')] \geq \xi$.

Figures (8)

  • Figure 1: An overview of CAMP. During training, the agents interact with the environments with observation noise and receive rewards. A reference policy has a reference Q-network learning through a vanilla temporal-difference loss while the Q-network of a primary policy is optimized by minimizing the CAMP loss to increase the gap between the top-1 and runner-up Q-values. The primary network also mimics the action predicted by the reference network during training with the imitation loss. The trained primary policy can then be certified by policy smoothing with better certified expected return at each certified radius.
  • Figure 2: Certification results on CartPole, Highway, Pong, Freeway, and Bank Heist. The perturbation budgets for Atari games (Freeway, Pong, and Bank Heist) are normalized by dividing by $255$.
  • Figure 3: Empirical robustness of agents against PGD in CartPole, Highway, Pong, Freeway, and Bank Heist. We use the same attack in PS kumar2021policy to evaluate the robustness of the agent in individual runs. The perturbation budgets for Freeway, Pong, and Bank Heist are normalized by dividing by $255$. In Pong, since agents either win or lose in each run, the expected return corresponds to the win rate.
  • Figure 4: Empirical robustness of agents against APGD in CartPole, Highway, Pong, Freeway, and Bank Heist. The average return values are evaluated under the same settings as PGD. APGD preserves the perturbation budget at each step, allowing it to perturb observations across more steps, which can result in more significant performance degradation for the agents.
  • Figure 5: Ablation on $\lambda$ values in Cartpole-1. The certified expected returns obtained from various $\lambda$ values are shown in the figures. Each figure presents results based on a fixed smoothing noise scale applied to the observed states. From left to right, the smoothing noise scales are 0.2, 0.4, 0.6, 0.8, and 1.0, respectively.
  • ...and 3 more figures

Theorems & Definitions (16)

  • Theorem 1: Change of variable
  • Theorem 2: Correlation between CDF and expectation
  • Lemma 1: Lipschitz continuity of smoothed return function
  • Theorem 3: Soft certified radius
  • proof
  • Theorem 4: Local certified radius
  • proof : Proof sketch
  • Theorem 4: Change of variable
  • proof
  • Theorem 4: Correlation between CDF and expectation
  • ...and 6 more