On Robust Reinforcement Learning with Lipschitz-Bounded Policy Networks

Nicholas H. Barbara; Ruigang Wang; Ian R. Manchester

On Robust Reinforcement Learning with Lipschitz-Bounded Policy Networks

Nicholas H. Barbara, Ruigang Wang, Ian R. Manchester

TL;DR

The paper addresses robustness of policy networks in deep reinforcement learning by enforcing Lipschitz bounds on the policy, i.e., $\mathrm{Lip}(\kappa) \le \gamma$, to limit sensitivity to perturbations. It compares multiple Lipschitz parameterizations, including spectral normalization (SN), almost orthogonal Lipschitz (AOL), Cayley transforms, and the expressive Sandwich layer, across pendulum swing-up and Atari Pong. Key findings show that smaller $\gamma$ improves robustness to disturbances and adversarial attacks, and that the Sandwich layer enables strong robustness without sacrificing nominal performance, outperforming SN and AOL at similar bounds. The work highlights the importance of layer design in robust RL and suggests future directions toward combining Lipschitz-bounded architectures with adversarial training and deployment on real robotic systems.

Abstract

This paper presents a study of robust policy networks in deep reinforcement learning. We investigate the benefits of policy parameterizations that naturally satisfy constraints on their Lipschitz bound, analyzing their empirical performance and robustness on two representative problems: pendulum swing-up and Atari Pong. We illustrate that policy networks with smaller Lipschitz bounds are more robust to disturbances, random noise, and targeted adversarial attacks than unconstrained policies composed of vanilla multi-layer perceptrons or convolutional neural networks. However, the structure of the Lipschitz layer is important. We find that the widely-used method of spectral normalization is too conservative and severely impacts clean performance, whereas more expressive Lipschitz layers such as the recently-proposed Sandwich layer can achieve improved robustness without sacrificing clean performance.

On Robust Reinforcement Learning with Lipschitz-Bounded Policy Networks

TL;DR

The paper addresses robustness of policy networks in deep reinforcement learning by enforcing Lipschitz bounds on the policy, i.e.,

, to limit sensitivity to perturbations. It compares multiple Lipschitz parameterizations, including spectral normalization (SN), almost orthogonal Lipschitz (AOL), Cayley transforms, and the expressive Sandwich layer, across pendulum swing-up and Atari Pong. Key findings show that smaller

improves robustness to disturbances and adversarial attacks, and that the Sandwich layer enables strong robustness without sacrificing nominal performance, outperforming SN and AOL at similar bounds. The work highlights the importance of layer design in robust RL and suggests future directions toward combining Lipschitz-bounded architectures with adversarial training and deployment on real robotic systems.

Abstract

Paper Structure (13 sections, 8 equations, 7 figures, 1 table)

This paper contains 13 sections, 8 equations, 7 figures, 1 table.

Introduction
Background and Prior Work
Deep Reinforcement Learning
Adversarial Attacks for RL
Lipschitz-Bounded Deep Networks
Experimental Setup
Results and Discussion
Illustrative Example --- Pendulum Swing-up
Comparing Architectures --- Atari Pong
Conclusions
Training details
Pendulum swing-up experiments
Atari Pong experiments

Figures (7)

Figure 1: Reinforcement learning and adversarial attacks.
Figure 2: Contours of control actions (a,b) and local Lipschitz bounds (c,d) for an unconstrained (MLP) and a Lipschitz-bounded (Sandwich, $\gamma = 4$) policy show how Lipschitz bounds control a policy's smoothness.
Figure 3: Pendulum trajectories generated by unconstrained (MLP) and Lipschitz-bounded (Sandwich, $\gamma = 4$) policies in nominal operation (a,b), with sample delays (c,d), and with $\ell_2$ adversarial attacks (e,f). Red lines indicate the target.
Figure 4: Robust performance of unconstrained (MLP) and Lipschitz-bounded (Sandwich) policies on pendulum swing-up under sample delays and $\ell_2$-optimal adversarial attacks. Panels (b,d) show cross-sections of (a,c) as a function of each model's empirically-estimated Lipschitz lower bound. Bands and error bars show one standard deviation over 10 random model initializations.
Figure 5: Robust performance of unconstrained (CNN) and Lipschitz-bounded (Sandwich) policies for Atari Pong. Bands and error bars show one standard deviation over 4 random model initializations.
...and 2 more figures

On Robust Reinforcement Learning with Lipschitz-Bounded Policy Networks

TL;DR

Abstract

On Robust Reinforcement Learning with Lipschitz-Bounded Policy Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)