Non-omniscient backdoor injection with one poison sample: Proving the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks
Thorsten Peinemann, Paula Arnold, Sebastian Berndt, Thomas Eisenbarth, Esfandiar Mohammadi
TL;DR
The paper investigates backdoor poisoning and introduces the one-poison hypothesis: a single poisoned sample, with limited background knowledge, can inject a backdoor with vanishing backdoor error while leaving the benign learning task largely intact. It provides rigorous proofs that this holds for linear classification, linear regression, and 2-layer ReLU neural networks, and extends to a subspace setting where poisoned and clean models become functionally equivalent. The authors bound the impact on the clean-data risk and validate the theory with experiments on realistic datasets, demonstrating high attack success with minimal collateral damage. The work highlights practical security risks from minimal data poisoning and points toward countermeasures that preserve accuracy, emphasizing the need for robust defenses against single-sample backdoors in common ML pipelines.
Abstract
Backdoor poisoning attacks are a threat to machine learning models trained on large data collected from untrusted sources; these attacks enable attackers to inject malicious behavior into the model that can be triggered by specially crafted inputs. Prior work has established bounds on the success of backdoor attacks and their impact on the benign learning task, however, an open question is what amount of poison data is needed for a successful backdoor attack. Typical attacks either use few samples but need much information about the data points, or need to poison many data points. In this paper, we formulate the one-poison hypothesis: An adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error and without significantly impacting the benign learning task performance. Moreover, we prove the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks. For adversaries that utilize a direction unused by the clean data distribution for the poison sample, we prove for linear classification and linear regression that the resulting model is functionally equivalent to a model where the poison was excluded from training. We build on prior work on statistical backdoor learning to show that in all other cases, the impact on the benign learning task is still limited. We validate our theoretical results experimentally with realistic benchmark data sets.
