Table of Contents
Fetching ...

Non-omniscient backdoor injection with one poison sample: Proving the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks

Thorsten Peinemann, Paula Arnold, Sebastian Berndt, Thomas Eisenbarth, Esfandiar Mohammadi

TL;DR

The paper investigates backdoor poisoning and introduces the one-poison hypothesis: a single poisoned sample, with limited background knowledge, can inject a backdoor with vanishing backdoor error while leaving the benign learning task largely intact. It provides rigorous proofs that this holds for linear classification, linear regression, and 2-layer ReLU neural networks, and extends to a subspace setting where poisoned and clean models become functionally equivalent. The authors bound the impact on the clean-data risk and validate the theory with experiments on realistic datasets, demonstrating high attack success with minimal collateral damage. The work highlights practical security risks from minimal data poisoning and points toward countermeasures that preserve accuracy, emphasizing the need for robust defenses against single-sample backdoors in common ML pipelines.

Abstract

Backdoor poisoning attacks are a threat to machine learning models trained on large data collected from untrusted sources; these attacks enable attackers to inject malicious behavior into the model that can be triggered by specially crafted inputs. Prior work has established bounds on the success of backdoor attacks and their impact on the benign learning task, however, an open question is what amount of poison data is needed for a successful backdoor attack. Typical attacks either use few samples but need much information about the data points, or need to poison many data points. In this paper, we formulate the one-poison hypothesis: An adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error and without significantly impacting the benign learning task performance. Moreover, we prove the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks. For adversaries that utilize a direction unused by the clean data distribution for the poison sample, we prove for linear classification and linear regression that the resulting model is functionally equivalent to a model where the poison was excluded from training. We build on prior work on statistical backdoor learning to show that in all other cases, the impact on the benign learning task is still limited. We validate our theoretical results experimentally with realistic benchmark data sets.

Non-omniscient backdoor injection with one poison sample: Proving the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks

TL;DR

The paper investigates backdoor poisoning and introduces the one-poison hypothesis: a single poisoned sample, with limited background knowledge, can inject a backdoor with vanishing backdoor error while leaving the benign learning task largely intact. It provides rigorous proofs that this holds for linear classification, linear regression, and 2-layer ReLU neural networks, and extends to a subspace setting where poisoned and clean models become functionally equivalent. The authors bound the impact on the clean-data risk and validate the theory with experiments on realistic datasets, demonstrating high attack success with minimal collateral damage. The work highlights practical security risks from minimal data poisoning and points toward countermeasures that preserve accuracy, emphasizing the need for robust defenses against single-sample backdoors in common ML pipelines.

Abstract

Backdoor poisoning attacks are a threat to machine learning models trained on large data collected from untrusted sources; these attacks enable attackers to inject malicious behavior into the model that can be triggered by specially crafted inputs. Prior work has established bounds on the success of backdoor attacks and their impact on the benign learning task, however, an open question is what amount of poison data is needed for a successful backdoor attack. Typical attacks either use few samples but need much information about the data points, or need to poison many data points. In this paper, we formulate the one-poison hypothesis: An adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error and without significantly impacting the benign learning task performance. Moreover, we prove the one-poison hypothesis for linear regression, linear classification, and 2-layer ReLU neural networks. For adversaries that utilize a direction unused by the clean data distribution for the poison sample, we prove for linear classification and linear regression that the resulting model is functionally equivalent to a model where the poison was excluded from training. We build on prior work on statistical backdoor learning to show that in all other cases, the impact on the benign learning task is still limited. We validate our theoretical results experimentally with realistic benchmark data sets.

Paper Structure

This paper contains 46 sections, 11 theorems, 63 equations, 4 figures, 10 tables.

Key Result

Lemma 5

Let the mean and the variance of the clean data distribution projected onto some $u \in \mathbb{R}^d$ be $\mu_\text{signal} = \underset{x \sim \mu_\text{\,cl}}{\mathbb{E}}\,\!{\left[x^Tu/\|u\|_2\right]}\xspace$ and $\sigma_\text{signal}^2 = \text{Var}_{x \sim \mu_\text{\,cl}}(x^Tu/\|u\|_2)$ and let

Figures (4)

  • Figure 1: One poison sample rotates the linear classifier enough to leave a malicious imprint. A test time patch then activates this imprint, switching predictions from the true class (-1) to the attacker’s target (+1) as shown in \ref{['fig:test-time-intuition']}.
  • Figure 2: During training, the attacker can steer the gradient by choosing sufficient poison strength.
  • Figure 3: During test time, the attacker triggers the backdoor by amplifying the malicious imprint on the classifier. The prediction of clean data changes from the correct class (-1) to the attacker-chosen class (+1) when the poison patch is applied.
  • Figure 4: Learning a linear classifier on a subspace: If clean data lie in a subspace with an unused direction, an attacker can exploit that direction so the model effectively learns two independent parts — one benign and one poison. Only the benign part affects clean data, yielding functional equivalence between clean and poisoned classifier.

Theorems & Definitions (27)

  • Definition 1: Regularized hinge loss
  • Definition 2: Regularized squared error loss
  • Definition 3: Binary cross-entropy loss
  • Definition 4: Statistical risk
  • Definition 5: Attacker's goal
  • Lemma 5
  • Theorem 6
  • proof
  • Corollary 6
  • Theorem 7
  • ...and 17 more