Table of Contents
Fetching ...

Adaptive Discounting of Training Time Attacks

Ridhima Bector, Abhay Aradhya, Chai Quek, Zinovi Rabinovich

TL;DR

The paper tackles training-time environment-poisoning attacks in reinforcement learning by introducing $\gamma$DDPG, a dual-priority reinforcement learner that uses a dynamic discount factor $\gamma$ to bound the attacker's search space while prioritizing attack accuracy over environmental modification effort. It augments the attacker state with current environment dynamics $T_{u_{i-1}}$ and a latent victim-behaviour $\phi_{u_{i-1}}$, derived from observed traces via an auto-encoder, and optimizes a reward encoding attack effectiveness. A key contribution is the adaptive discount mechanism, including Wasserstein-distance-based variants, which improves robustness to partial observability and uncertainty and yields better generalization to unseen victim initializations. Empirical results on a 3D grid world show WD-based discounts outperform fixed discounts and the TEPA baseline in test-time performance, while maintaining lower attacker effort. The work highlights practical implications for analyzing and mitigating robustly secure learning in RL, and points to extensions to continuous domains and more complex multi-objective settings.

Abstract

Among the most insidious attacks on Reinforcement Learning (RL) solutions are training-time attacks (TTAs) that create loopholes and backdoors in the learned behaviour. Not limited to a simple disruption, constructive TTAs (C-TTAs) are now available, where the attacker forces a specific, target behaviour upon a training RL agent (victim). However, even state-of-the-art C-TTAs focus on target behaviours that could be naturally adopted by the victim if not for a particular feature of the environment dynamics, which C-TTAs exploit. In this work, we show that a C-TTA is possible even when the target behaviour is un-adoptable due to both environment dynamics as well as non-optimality with respect to the victim objective(s). To find efficient attacks in this context, we develop a specialised flavour of the DDPG algorithm, which we term gammaDDPG, that learns this stronger version of C-TTA. gammaDDPG dynamically alters the attack policy planning horizon based on the victim's current behaviour. This improves effort distribution throughout the attack timeline and reduces the effect of uncertainty the attacker has about the victim. To demonstrate the features of our method and better relate the results to prior research, we borrow a 3D grid domain from a state-of-the-art C-TTA for our experiments. Code is available at "bit.ly/github-rb-gDDPG".

Adaptive Discounting of Training Time Attacks

TL;DR

The paper tackles training-time environment-poisoning attacks in reinforcement learning by introducing DDPG, a dual-priority reinforcement learner that uses a dynamic discount factor to bound the attacker's search space while prioritizing attack accuracy over environmental modification effort. It augments the attacker state with current environment dynamics and a latent victim-behaviour , derived from observed traces via an auto-encoder, and optimizes a reward encoding attack effectiveness. A key contribution is the adaptive discount mechanism, including Wasserstein-distance-based variants, which improves robustness to partial observability and uncertainty and yields better generalization to unseen victim initializations. Empirical results on a 3D grid world show WD-based discounts outperform fixed discounts and the TEPA baseline in test-time performance, while maintaining lower attacker effort. The work highlights practical implications for analyzing and mitigating robustly secure learning in RL, and points to extensions to continuous domains and more complex multi-objective settings.

Abstract

Among the most insidious attacks on Reinforcement Learning (RL) solutions are training-time attacks (TTAs) that create loopholes and backdoors in the learned behaviour. Not limited to a simple disruption, constructive TTAs (C-TTAs) are now available, where the attacker forces a specific, target behaviour upon a training RL agent (victim). However, even state-of-the-art C-TTAs focus on target behaviours that could be naturally adopted by the victim if not for a particular feature of the environment dynamics, which C-TTAs exploit. In this work, we show that a C-TTA is possible even when the target behaviour is un-adoptable due to both environment dynamics as well as non-optimality with respect to the victim objective(s). To find efficient attacks in this context, we develop a specialised flavour of the DDPG algorithm, which we term gammaDDPG, that learns this stronger version of C-TTA. gammaDDPG dynamically alters the attack policy planning horizon based on the victim's current behaviour. This improves effort distribution throughout the attack timeline and reduces the effect of uncertainty the attacker has about the victim. To demonstrate the features of our method and better relate the results to prior research, we borrow a 3D grid domain from a state-of-the-art C-TTA for our experiments. Code is available at "bit.ly/github-rb-gDDPG".
Paper Structure (21 sections, 11 equations, 7 figures, 6 tables, 3 algorithms)

This paper contains 21 sections, 11 equations, 7 figures, 6 tables, 3 algorithms.

Figures (7)

  • Figure 1: Bi-Level Attack Framework
  • Figure 2: Training-Time statistics (a-c, e-h) and Test-Time performance (i-l) w.r.t. Accuracy (@Acc), Softmax Accuracy (@SoftAcc), Effort (@Effort), and Time (@Time) of $\gamma$DDPG with best fixed-discount (0.90) and dynamic discounts KLR and WD. The dotted graphs in Test-Time plots (i-l) represent attacks on victims initialised with random numbers using different seeds.
  • Figure 3: Training-Time statistics (a-c, e-h) and Test-Time performance (i-l) w.r.t. Accuracy (@Acc), Softmax Accuracy (@SoftAcc), Effort (@Effort), and Time (@Time) of baseline TEPA vs $\gamma$DDPG with dynamic discounts WD and TargetKLR. The dotted graphs in Test-Time plots (i-l) represent attacks on victims initialised with random numbers using different seeds.
  • Figure 4: Default (Un-Attacked) Victim Environment
  • Figure 5: Training-Time Mean Attack Accuracy of KLR and WD dynamic discounts w.r.t. normalisation ranges [0.90-0.99], [0.80-0.99], [0.70-0.99], [0.60-0.99] and [0.50-0.99]
  • ...and 2 more figures