Data-Driven and Stealthy Deactivation of Safety Filters
Daniel Arnström, André M. H. Teixeira
TL;DR
This work tackles the vulnerability of safety filters in safety-critical, learning-enabled control by presenting a data-driven, stealthy false-data injection attack that deactivates safety safeguards without requiring prior knowledge of system dynamics, safety sets, or observer gains. The approach replaces model-based components with surrogates learned from data, using an offline phase to identify a latent observer model and a latent safe set, and an online phase to inject measurements that steer the latent state toward the interior of the latent safety region while evading detection. Leveraging topological equivalence between the true observer and the latent model, the attack biases the actual state estimate into the safe-set interior, enabling unsafe control actions to be applied. The inverted pendulum experiment demonstrates that the adversary can cause the true state to leave the safe set while the estimator remains within, highlighting a practical and concerning weakness in safety-filter design and underscoring the need for more robust detectors and verification techniques.
Abstract
Safety filters ensure that control actions that are executed are always safe, no matter the controller in question. Previous work has proposed a simple and stealthy false-data injection attack for deactivating such safety filters. This attack injects false sensor measurements to bias state estimates toward the interior of a safety region, making the safety filter accept unsafe control actions. The attack does, however, require the adversary to know the dynamics of the system, the safety region used in the safety filter, and the observer gain. In this work we relax these requirements and show how a similar data-injection attack can be performed when the adversary only observes the input and output of the observer that is used by the safety filter, without any a priori knowledge about the system dynamics, safety region, or observer gain. In particular, the adversary uses the observed data to identify a state-space model that describes the observer dynamics, and then approximates a safety region in the identified embedding. We exemplify the data-driven attack on an inverted pendulum, where we show how the attack can make the system leave a safe set, even when a safety filter is supposed to stop this from happening.
