Table of Contents
Fetching ...

On the Tension Between Optimality and Adversarial Robustness in Policy Optimization

Haoran Li, Jiayu Lv, Congying Han, Zicheng Zhang, Anqi Li, Yan Liu, Tiande Guo, Nan Jiang

TL;DR

<3-5 sentence high-level summary> The paper investigates the tension between achieving optimality and adversarial robustness in policy optimization. It analyzes SPO and ARPO within the ISA-MDP framework, revealing that ARPO improves robustness at the cost of natural return due to adversarial landscape reshaping, while SPO can underperform under attack. To reconcile these effects, it proposes BARPO, a bilevel framework that modulates adversary strength and leverages a KL-surrogate to maintain navigability and preserve global optima. Empirical results on MuJoCo tasks show BARPO consistently outperforms vanilla ARPO and, with SPO guidance, achieves strong natural and robust performance, suggesting a practical route toward aligning theory and practice in robust RL.

Abstract

Achieving optimality and adversarial robustness in deep reinforcement learning has long been regarded as conflicting goals. Nonetheless, recent theoretical insights presented in CAR suggest a potential alignment, raising the important question of how to realize this in practice. This paper first identifies a key gap between theory and practice by comparing standard policy optimization (SPO) and adversarially robust policy optimization (ARPO). Although they share theoretical consistency, a fundamental tension between robustness and optimality arises in practical policy gradient methods. SPO tends toward convergence to vulnerable first-order stationary policies (FOSPs) with strong natural performance, whereas ARPO typically favors more robust FOSPs at the expense of reduced returns. Furthermore, we attribute this tradeoff to the reshaping effect of the strongest adversary in ARPO, which significantly complicates the global landscape by inducing deceptive sticky FOSPs. This improves robustness but makes navigation more challenging. To alleviate this, we develop the BARPO, a bilevel framework unifying SPO and ARPO by modulating adversary strength, thereby facilitating navigability while preserving global optima. Extensive empirical results demonstrate that BARPO consistently outperforms vanilla ARPO, providing a practical approach to reconcile theoretical and empirical performance.

On the Tension Between Optimality and Adversarial Robustness in Policy Optimization

TL;DR

<3-5 sentence high-level summary> The paper investigates the tension between achieving optimality and adversarial robustness in policy optimization. It analyzes SPO and ARPO within the ISA-MDP framework, revealing that ARPO improves robustness at the cost of natural return due to adversarial landscape reshaping, while SPO can underperform under attack. To reconcile these effects, it proposes BARPO, a bilevel framework that modulates adversary strength and leverages a KL-surrogate to maintain navigability and preserve global optima. Empirical results on MuJoCo tasks show BARPO consistently outperforms vanilla ARPO and, with SPO guidance, achieves strong natural and robust performance, suggesting a practical route toward aligning theory and practice in robust RL.

Abstract

Achieving optimality and adversarial robustness in deep reinforcement learning has long been regarded as conflicting goals. Nonetheless, recent theoretical insights presented in CAR suggest a potential alignment, raising the important question of how to realize this in practice. This paper first identifies a key gap between theory and practice by comparing standard policy optimization (SPO) and adversarially robust policy optimization (ARPO). Although they share theoretical consistency, a fundamental tension between robustness and optimality arises in practical policy gradient methods. SPO tends toward convergence to vulnerable first-order stationary policies (FOSPs) with strong natural performance, whereas ARPO typically favors more robust FOSPs at the expense of reduced returns. Furthermore, we attribute this tradeoff to the reshaping effect of the strongest adversary in ARPO, which significantly complicates the global landscape by inducing deceptive sticky FOSPs. This improves robustness but makes navigation more challenging. To alleviate this, we develop the BARPO, a bilevel framework unifying SPO and ARPO by modulating adversary strength, thereby facilitating navigability while preserving global optima. Extensive empirical results demonstrate that BARPO consistently outperforms vanilla ARPO, providing a practical approach to reconcile theoretical and empirical performance.

Paper Structure

This paper contains 69 sections, 16 theorems, 130 equations, 7 figures, 12 tables, 3 algorithms.

Key Result

Theorem 3.1

Given a policy $\pi_\theta$, for all state $s\in \mathcal{S}$, consider the direct parameterization representation for adversary $\nu_\vartheta:\mathcal{S} \rightarrow \mathcal{S},\ s\mapsto s + \vartheta_s \in B(s)$. Then, for any state $s_i\in \mathcal{S}$, we have the state-wise policy gradient f where $d^{\pi \circ \nu}$ is the state-action visitation distribution under $\pi\circ\nu$, and $\ma

Figures (7)

  • Figure 1: Schematic illustration of the optimization landscapes under SPO, ARPO, and BARPO. (a) SPO acsents along fragile directions, leading to vulnerable FOSPs with high natural value. (b) ARPO becomes trapped in robust regions but is limited to low-return solutions. (c) BARPO reshapes the landscape by lifting robust but low-return regions, enabling convergence to robust FOSPs with high returns. (d) Overall comparison of the three paradigms: contour lines represent natural returns, while background color indicates robustness, with darker red denoting lower robustness.
  • Figure 2: Intuitive examples of reshaping.
  • Figure 3: Adversarially robust value geometry.
  • Figure 4: Illustrative Optimization Landscapes
  • Figure 5: Natural and robust performance of SPO, ARPO, BARPO without guidance, and BARPO for four continuous control tasks in MuJoCo. BARPO (w/o g) consistently outperforms ARPO.
  • ...and 2 more figures

Theorems & Definitions (31)

  • Theorem 3.1: Policy Gradient for Adversary
  • Theorem 3.2: Convergence of ARPO
  • Proposition 3.1
  • Proposition 3.2: Vulnerable Connectivity Leading to Separated FOSPs
  • Theorem 4.1: Surrogate Adversary
  • Theorem C.1: Policy Gradient for Adversary
  • Remark C.1
  • proof
  • Definition C.1: $\delta$-Approximation Adversary
  • Lemma C.1: Lipschitz of Strongest Adversary
  • ...and 21 more