Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks
Tim Franzmeyer, Stephen McAleer, João F. Henriques, Jakob N. Foerster, Philip H. S. Torr, Adel Bibi, Christian Schroeder de Witt
TL;DR
The paper addresses the vulnerability of sequential decision-makers to observation-space adversaries and introduces ε-illusory attacks, a framework that enforces a KL-divergence-based detectability constraint to remain statistically indistinguishable from unattacked trajectories. It derives a dual-ascent optimization method to learn such attacks and provides a scalable estimator for the KL objective, enabling end-to-end training in high-dimensional control environments. Empirical results show ε-illusory attacks outperform traditional attacks against automated detectors and are harder to detect for humans, underscoring the need for stronger anomaly detectors and system-level defenses. This work advances secure RL by combining information-theoretic detectability with adversarial learning, highlighting practical implications for cyber-physical security and defense design.
Abstract
Autonomous agents deployed in the real world need to be robust against adversarial attacks on sensory inputs. Robustifying agent policies requires anticipating the strongest attacks possible. We demonstrate that existing observation-space attacks on reinforcement learning agents have a common weakness: while effective, their lack of information-theoretic detectability constraints makes them detectable using automated means or human inspection. Detectability is undesirable to adversaries as it may trigger security escalations. We introduce ε-illusory, a novel form of adversarial attack on sequential decision-makers that is both effective and of ε-bounded statistical detectability. We propose a novel dual ascent algorithm to learn such attacks end-to-end. Compared to existing attacks, we empirically find ε-illusory to be significantly harder to detect with automated methods, and a small study with human participants (IRB approval under reference R84123/RE001) suggests they are similarly harder to detect for humans. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses. The project website can be found at https://tinyurl.com/illusory-attacks.
