Adaptive Discounting of Training Time Attacks

Ridhima Bector; Abhay Aradhya; Chai Quek; Zinovi Rabinovich

Adaptive Discounting of Training Time Attacks

Ridhima Bector, Abhay Aradhya, Chai Quek, Zinovi Rabinovich

TL;DR

The paper tackles training-time environment-poisoning attacks in reinforcement learning by introducing $\gamma$DDPG, a dual-priority reinforcement learner that uses a dynamic discount factor $\gamma$ to bound the attacker's search space while prioritizing attack accuracy over environmental modification effort. It augments the attacker state with current environment dynamics $T_{u_{i-1}}$ and a latent victim-behaviour $\phi_{u_{i-1}}$, derived from observed traces via an auto-encoder, and optimizes a reward encoding attack effectiveness. A key contribution is the adaptive discount mechanism, including Wasserstein-distance-based variants, which improves robustness to partial observability and uncertainty and yields better generalization to unseen victim initializations. Empirical results on a 3D grid world show WD-based discounts outperform fixed discounts and the TEPA baseline in test-time performance, while maintaining lower attacker effort. The work highlights practical implications for analyzing and mitigating robustly secure learning in RL, and points to extensions to continuous domains and more complex multi-objective settings.

Abstract

Among the most insidious attacks on Reinforcement Learning (RL) solutions are training-time attacks (TTAs) that create loopholes and backdoors in the learned behaviour. Not limited to a simple disruption, constructive TTAs (C-TTAs) are now available, where the attacker forces a specific, target behaviour upon a training RL agent (victim). However, even state-of-the-art C-TTAs focus on target behaviours that could be naturally adopted by the victim if not for a particular feature of the environment dynamics, which C-TTAs exploit. In this work, we show that a C-TTA is possible even when the target behaviour is un-adoptable due to both environment dynamics as well as non-optimality with respect to the victim objective(s). To find efficient attacks in this context, we develop a specialised flavour of the DDPG algorithm, which we term gammaDDPG, that learns this stronger version of C-TTA. gammaDDPG dynamically alters the attack policy planning horizon based on the victim's current behaviour. This improves effort distribution throughout the attack timeline and reduces the effect of uncertainty the attacker has about the victim. To demonstrate the features of our method and better relate the results to prior research, we borrow a 3D grid domain from a state-of-the-art C-TTA for our experiments. Code is available at "bit.ly/github-rb-gDDPG".

Adaptive Discounting of Training Time Attacks

TL;DR

The paper tackles training-time environment-poisoning attacks in reinforcement learning by introducing

DDPG, a dual-priority reinforcement learner that uses a dynamic discount factor

to bound the attacker's search space while prioritizing attack accuracy over environmental modification effort. It augments the attacker state with current environment dynamics

and a latent victim-behaviour

, derived from observed traces via an auto-encoder, and optimizes a reward encoding attack effectiveness. A key contribution is the adaptive discount mechanism, including Wasserstein-distance-based variants, which improves robustness to partial observability and uncertainty and yields better generalization to unseen victim initializations. Empirical results on a 3D grid world show WD-based discounts outperform fixed discounts and the TEPA baseline in test-time performance, while maintaining lower attacker effort. The work highlights practical implications for analyzing and mitigating robustly secure learning in RL, and points to extensions to continuous domains and more complex multi-objective settings.

Abstract

Paper Structure (21 sections, 11 equations, 7 figures, 6 tables, 3 algorithms)

This paper contains 21 sections, 11 equations, 7 figures, 6 tables, 3 algorithms.

Introduction
Related Work
Non-Constant Discounts
Adversarial Reinforcement Learning
Methodology : $\mathbf{\gamma}$-variant DDPG
State Space
Adaptive Discount Function
Experiments
Conclusion and Future Work
(Expanded) Related Work
Adaptive Markov Decision Processes
Unsupervised Environment Design
Attacker's State and Action Spaces
(Expanded) $\boldsymbol{\gamma}$DDPG Algorithm
Victim Environment Grid
...and 6 more sections

Figures (7)

Figure 1: Bi-Level Attack Framework
Figure 2: Training-Time statistics (a-c, e-h) and Test-Time performance (i-l) w.r.t. Accuracy (@Acc), Softmax Accuracy (@SoftAcc), Effort (@Effort), and Time (@Time) of $\gamma$DDPG with best fixed-discount (0.90) and dynamic discounts KLR and WD. The dotted graphs in Test-Time plots (i-l) represent attacks on victims initialised with random numbers using different seeds.
Figure 3: Training-Time statistics (a-c, e-h) and Test-Time performance (i-l) w.r.t. Accuracy (@Acc), Softmax Accuracy (@SoftAcc), Effort (@Effort), and Time (@Time) of baseline TEPA vs $\gamma$DDPG with dynamic discounts WD and TargetKLR. The dotted graphs in Test-Time plots (i-l) represent attacks on victims initialised with random numbers using different seeds.
Figure 4: Default (Un-Attacked) Victim Environment
Figure 5: Training-Time Mean Attack Accuracy of KLR and WD dynamic discounts w.r.t. normalisation ranges [0.90-0.99], [0.80-0.99], [0.70-0.99], [0.60-0.99] and [0.50-0.99]
...and 2 more figures

Adaptive Discounting of Training Time Attacks

TL;DR

Abstract

Adaptive Discounting of Training Time Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)