Table of Contents
Fetching ...

Analyzing Adversarial Inputs in Deep Reinforcement Learning

Davide Corsi, Guy Amir, Guy Katz, Alessandro Farinelli

TL;DR

We study safety in Deep Reinforcement Learning under adversarial inputs by introducing Adversarial Rate, a formal metric that quantifies vulnerability using offline verification tools ProVe and CountingProVe. The approach analyzes two DRL benchmarks (Jumping World and Robotic Mapless Navigation) and two training algorithms (PPO and TD3), revealing that adversarial inputs concentrate in small regions and can shift during training, even for models that perform well empirically. A key finding is the positive correlation between network size and susceptibility, while activation function type shows no consistent effect, highlighting architecture as a major factor in robustness. The work offers practical verification-driven insights and retraining strategies, and argues for integrating formal safeguards into DRL pipelines to enable safer deployment in safety-critical domains.

Abstract

In recent years, Deep Reinforcement Learning (DRL) has become a popular paradigm in machine learning due to its successful applications to real-world and complex systems. However, even the state-of-the-art DRL models have been shown to suffer from reliability concerns -- for example, their susceptibility to adversarial inputs, i.e., small and abundant input perturbations that can fool the models into making unpredictable and potentially dangerous decisions. This drawback limits the deployment of DRL systems in safety-critical contexts, where even a small error cannot be tolerated. In this work, we present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification. Specifically, we introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations, and present a set of tools and algorithms for its computation. Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations. Moreover, we analyze the behavior of these configurations to suggest several useful practices and guidelines to help mitigate the vulnerability of trained DRL networks.

Analyzing Adversarial Inputs in Deep Reinforcement Learning

TL;DR

We study safety in Deep Reinforcement Learning under adversarial inputs by introducing Adversarial Rate, a formal metric that quantifies vulnerability using offline verification tools ProVe and CountingProVe. The approach analyzes two DRL benchmarks (Jumping World and Robotic Mapless Navigation) and two training algorithms (PPO and TD3), revealing that adversarial inputs concentrate in small regions and can shift during training, even for models that perform well empirically. A key finding is the positive correlation between network size and susceptibility, while activation function type shows no consistent effect, highlighting architecture as a major factor in robustness. The work offers practical verification-driven insights and retraining strategies, and argues for integrating formal safeguards into DRL pipelines to enable safer deployment in safety-critical domains.

Abstract

In recent years, Deep Reinforcement Learning (DRL) has become a popular paradigm in machine learning due to its successful applications to real-world and complex systems. However, even the state-of-the-art DRL models have been shown to suffer from reliability concerns -- for example, their susceptibility to adversarial inputs, i.e., small and abundant input perturbations that can fool the models into making unpredictable and potentially dangerous decisions. This drawback limits the deployment of DRL systems in safety-critical contexts, where even a small error cannot be tolerated. In this work, we present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification. Specifically, we introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations, and present a set of tools and algorithms for its computation. Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations. Moreover, we analyze the behavior of these configurations to suggest several useful practices and guidelines to help mitigate the vulnerability of trained DRL networks.
Paper Structure (25 sections, 4 equations, 9 figures, 6 tables)

This paper contains 25 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: A toy DNN.
  • Figure 2: An example of interval propagation for a reachability approach to verification.
  • Figure 3: A toy example for the iterative splitting procedure of ProVe. In the case depicted in the first figure, it is not possible to formally prove where $Y1$ is greater than $Y2$ given that the upper and lower bounds overlap. In the second and third figures, the iterative splitting procedure allows the division of the input domain into safe and unsafe regions.
  • Figure 4: The Jumping World environment analyzed for our experimental evaluation. On the left is a screenshot from the simulation and on the right are the empirical results of our training phase.
  • Figure 5: The Robotic Mapless Navigation environments analyzed for our experimental evaluation. On the left is a screenshot from the simulation and on the right are the empirical results of our training phase.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1: The DNN-Verification Problem
  • Definition 2: The #DNN-Verification Problem