Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks

Tim Franzmeyer; Stephen McAleer; João F. Henriques; Jakob N. Foerster; Philip H. S. Torr; Adel Bibi; Christian Schroeder de Witt

Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks

Tim Franzmeyer, Stephen McAleer, João F. Henriques, Jakob N. Foerster, Philip H. S. Torr, Adel Bibi, Christian Schroeder de Witt

TL;DR

The paper addresses the vulnerability of sequential decision-makers to observation-space adversaries and introduces ε-illusory attacks, a framework that enforces a KL-divergence-based detectability constraint to remain statistically indistinguishable from unattacked trajectories. It derives a dual-ascent optimization method to learn such attacks and provides a scalable estimator for the KL objective, enabling end-to-end training in high-dimensional control environments. Empirical results show ε-illusory attacks outperform traditional attacks against automated detectors and are harder to detect for humans, underscoring the need for stronger anomaly detectors and system-level defenses. This work advances secure RL by combining information-theoretic detectability with adversarial learning, highlighting practical implications for cyber-physical security and defense design.

Abstract

Autonomous agents deployed in the real world need to be robust against adversarial attacks on sensory inputs. Robustifying agent policies requires anticipating the strongest attacks possible. We demonstrate that existing observation-space attacks on reinforcement learning agents have a common weakness: while effective, their lack of information-theoretic detectability constraints makes them detectable using automated means or human inspection. Detectability is undesirable to adversaries as it may trigger security escalations. We introduce ε-illusory, a novel form of adversarial attack on sequential decision-makers that is both effective and of ε-bounded statistical detectability. We propose a novel dual ascent algorithm to learn such attacks end-to-end. Compared to existing attacks, we empirically find ε-illusory to be significantly harder to detect with automated methods, and a small study with human participants (IRB approval under reference R84123/RE001) suggests they are similarly harder to detect for humans. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses. The project website can be found at https://tinyurl.com/illusory-attacks.

Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks

TL;DR

Abstract

Paper Structure (37 sections, 1 theorem, 9 equations, 8 figures, 3 tables, 2 algorithms)

This paper contains 37 sections, 1 theorem, 9 equations, 8 figures, 3 tables, 2 algorithms.

Introduction
Related work
Background and notation
MDP and POMDP.
Observation-space adversarial attacks.
Information-Theoretic Hypothesis Testing
Illusory attacks
The Illusory Attack Framework
The Illusory Optimisation Objective
Example.
Dual-Ascent Formulation
Estimating the KL-Objective
Empirical evaluation of illusory attacks
Experimental setup.
Precisely controlling trajectory KL divergence.
...and 22 more sections

Key Result

Theorem A.1

For any $\mathcal{E}^{\hbox{$(\cdot)$}{}}_{\nu}$, there exists a corresponding POMDP $\mathcal{E}_e\left(\mathcal{E}^{\hbox{$(\cdot)$}{}}_{\nu}\right)$ for which the victim's learning problem is identical.

Figures (8)

Figure 1: We see adversary performance (reduction in the victim's reward) mapped against the KL divergence between the unattacked training and the attacked test distribution. Attacks with a small L2 attack budget (indicated by small circles) can be defended against using randomized smoothing, and attacks with a large KL divergence can be defended against by triggering contingency options upon detection of the attack (purple shaded area). Illusory attacks (blue) can achieve significantly higher performance than classic adversarial attacks (black), as they allow to limit the KL divergence and thereby avoid detection.
Figure 2: Left: The unattacked MDP with an expected victim return of $1$. Right: A regular adversarial attack and a perfect illusory attack, with an expected vitim return of $0$ and $\frac{1}{6}$, respectively. The perfect illusory attack chooses observations $o_0$ such that the KL divergence between the attacked and unattacked observation distribution is zero.
Figure 3: Empirical results for the 1-step MDP defined in Figure \ref{['fig:1step_example']}. The adversary's expected return increases with increasing $\epsilon$. At the same time, the empirical trajectory KL constraint tightly controls the adversary policy's within $\epsilon$ detectability. The purple line indicates the adversary's attack return ceiling at $0.0$.
Figure 4: We display normalised adversary scores, indicating the reduction in the victim's reward, on the y-axis. Each plot shows results in different environments, with different adversarial attacks on the x-axis. We show both the raw adversary score, as well as the adversary score adjusted for detection rates of different adversarial attacks (see Figure \ref{['fig:detection_results']}). While the SA-MDP and MNP benchmark attacks achieve higher unadjusted scores, their high detection rates result in significantly lower adjusted scores.
Figure 5: Different adversarial attacks are shown on the x-axis, with detection rates on the y-axis. We see that both the automated detector as well as human subjects are able to detect SA-MDP and MNP attacks, while $\epsilon$-illusory attacks are less likely to be detected.
...and 3 more figures

Theorems & Definitions (4)

Definition 4.1: $\epsilon$-illusory attacks
Definition 4.2: Perfect illusory attacks
Theorem A.1: POMDP Correspondence
proof

Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks

TL;DR

Abstract

Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (4)