Table of Contents
Fetching ...

Stronger Data Poisoning Attacks Break Data Sanitization Defenses

Pang Wei Koh, Jacob Steinhardt, Percy Liang

TL;DR

This work studies the vulnerability of data sanitization defenses to coordinated data poisoning. It introduces three attacks—Influence, KKT, and Min-Max—that exploit concentrated poisoning and constrained bilevel optimization to bypass anomaly detectors like $L_2$, slab, $k$-NN, and SVD, significantly degrading test accuracy with only a 3% poisoning budget on datasets such as Enron and IMDB. The attacks rely on concentrating poisoned points in a small number of locations, using decoy parameters to simplify optimization, and, when needed, applying randomized rounding and LP relaxations to handle integer inputs and defense constraints. Experimental results across binary and multi-class tasks (including MNIST) show substantial test-error increases while bypassing defenses, underscoring the need for more robust, possibly provable defenses against coordinated data poisoning. The study emphasizes that defenses must anticipate attackers who optimize against them, and highlights directions such as stronger outlier detection, robust estimation, and data-trusted strategies to improve resilience in real-world ML systems.

Abstract

Machine learning models trained on data from the outside world can be corrupted by data poisoning attacks that inject malicious points into the models' training sets. A common defense against these attacks is data sanitization: first filter out anomalous training points before training the model. In this paper, we develop three attacks that can bypass a broad range of common data sanitization defenses, including anomaly detectors based on nearest neighbors, training loss, and singular-value decomposition. By adding just 3% poisoned data, our attacks successfully increase test error on the Enron spam detection dataset from 3% to 24% and on the IMDB sentiment classification dataset from 12% to 29%. In contrast, existing attacks which do not explicitly account for these data sanitization defenses are defeated by them. Our attacks are based on two ideas: (i) we coordinate our attacks to place poisoned points near one another, and (ii) we formulate each attack as a constrained optimization problem, with constraints designed to ensure that the poisoned points evade detection. As this optimization involves solving an expensive bilevel problem, our three attacks correspond to different ways of approximating this problem, based on influence functions; minimax duality; and the Karush-Kuhn-Tucker (KKT) conditions. Our results underscore the need to develop more robust defenses against data poisoning attacks.

Stronger Data Poisoning Attacks Break Data Sanitization Defenses

TL;DR

This work studies the vulnerability of data sanitization defenses to coordinated data poisoning. It introduces three attacks—Influence, KKT, and Min-Max—that exploit concentrated poisoning and constrained bilevel optimization to bypass anomaly detectors like , slab, -NN, and SVD, significantly degrading test accuracy with only a 3% poisoning budget on datasets such as Enron and IMDB. The attacks rely on concentrating poisoned points in a small number of locations, using decoy parameters to simplify optimization, and, when needed, applying randomized rounding and LP relaxations to handle integer inputs and defense constraints. Experimental results across binary and multi-class tasks (including MNIST) show substantial test-error increases while bypassing defenses, underscoring the need for more robust, possibly provable defenses against coordinated data poisoning. The study emphasizes that defenses must anticipate attackers who optimize against them, and highlights directions such as stronger outlier detection, robust estimation, and data-trusted strategies to improve resilience in real-world ML systems.

Abstract

Machine learning models trained on data from the outside world can be corrupted by data poisoning attacks that inject malicious points into the models' training sets. A common defense against these attacks is data sanitization: first filter out anomalous training points before training the model. In this paper, we develop three attacks that can bypass a broad range of common data sanitization defenses, including anomaly detectors based on nearest neighbors, training loss, and singular-value decomposition. By adding just 3% poisoned data, our attacks successfully increase test error on the Enron spam detection dataset from 3% to 24% and on the IMDB sentiment classification dataset from 12% to 29%. In contrast, existing attacks which do not explicitly account for these data sanitization defenses are defeated by them. Our attacks are based on two ideas: (i) we coordinate our attacks to place poisoned points near one another, and (ii) we formulate each attack as a constrained optimization problem, with constraints designed to ensure that the poisoned points evade detection. As this optimization involves solving an expensive bilevel problem, our three attacks correspond to different ways of approximating this problem, based on influence functions; minimax duality; and the Karush-Kuhn-Tucker (KKT) conditions. Our results underscore the need to develop more robust defenses against data poisoning attacks.

Paper Structure

This paper contains 62 sections, 8 theorems, 44 equations, 13 figures, 2 tables, 5 algorithms.

Key Result

Theorem 1

Consider a defender that learns a 2-class SVM or logistic regression model by first discarding all points outside a fixed feasible set $\mathcal{F}$ and then minimizing the average (regularized) training loss. Suppose that for each class $y = -1, +1$, the feasible set $\mathcal{F}_{y} \stackrel{\rm

Figures (13)

  • Figure 1: Left: In the absence of any poisoned data, the defender can often learn model parameters $\hat{\theta}$ that fit the true data $\mathcal{D}_\text{c}$ well. Here, we show the decision boundary learned by a linear support vector machine on synthetic data. Middle: However, the addition of poisoned data $\mathcal{D}_\text{p}$ can significantly change the learned $\hat{\theta}$, leading to high test error $\mathcal{L}(\hat{\theta})$. Right: By discarding outliers from $\mathcal{D} = \mathcal{D}_\text{c} \cup \mathcal{D}_\text{p}$ and then training on the remaining $\mathcal{D}_{\text{san}}$, the defender can mitigate the effectiveness of the attacker. In this example, the defender discards all blue points outside the blue ellipse, and all red points outside the red ellipse.
  • Figure 2: Plot of $\mathbb{E}[\|\hat{x}\|_2^2] = f(x)$ against $x$ for scalar $x$.
  • Figure 3: The KKT and min-max attacks give slightly higher test error than the influence attack on the Enron dataset. Moreover, they are more computationally efficient, and can be run on the larger and higher-dimensional IMDB dataset.
  • Figure 4: The test error achieved by the different attacks (against the L2 defense), vs. the number of minutes taken to generate the attacks. Each step increase in test error represents the processing of one choice of decoy parameters (for the KKT and min-max attacks) or 10 gradient steps (for the influence attack).
  • Figure 5: Iteratively updating the feasible set $\mathcal{F}_\beta$ increases test error by a few percentage points on the Enron dataset (with $\epsilon = 3\%$ poisoned data), compared to fixing the feasible set based on just the clean data $\mathcal{D}_\text{c}$.
  • ...and 8 more figures

Theorems & Definitions (11)

  • Theorem 1: 2 points suffice for 2-class SVMs and logistic regression
  • Remark 2
  • Definition 1
  • Proposition 1
  • Remark 3
  • Lemma 1
  • Proposition 2: Carathéodory number of $\mathcal{G}$ for a 2-class SVM
  • Proposition 3: Carathéodory number of $\mathcal{G}$ for margin-based losses
  • Corollary 1: Carathéodory number of $\mathcal{G}$ for logistic regression
  • Theorem \ref{thm:2points}: 2 points suffice for 2-class SVMs and logistic regression
  • ...and 1 more