Provably effective detection of effective data poisoning attacks
Jonathan Gallagher, Yasaman Esfandiari, Callen MacPhee, Michael Warren
TL;DR
The paper tackles data-poisoning detection by formalizing poisoning as a stochastic process on exchangeable data sequences and introducing the Conformal Separability Test, a polynomial-time, information-theoretic detector. It defines trigger attacks via a split $D = D_P \uplus D_C$ and Markov-kernel-based triggers, then proves that effectively poisoned data yield separable conformal prediction sets, enabling reliable detection. The authors provide both theoretical guarantees and empirical validation on CIFAR-10 and GTSRB, including robust detection of subtle attacks like Witches' Brew and competitive false positive/negative rates relative to state-of-the-art defenses. This work offers a principled, provable defense framework with practical applicability for proactive data integrity in ML pipelines.
Abstract
This paper establishes a mathematically precise definition of dataset poisoning attack and proves that the very act of effectively poisoning a dataset ensures that the attack can be effectively detected. On top of a mathematical guarantee that dataset poisoning is identifiable by a new statistical test that we call the Conformal Separability Test, we provide experimental evidence that we can adequately detect poisoning attempts in the real world.
