Table of Contents
Fetching ...

Certified Robustness to Data Poisoning in Gradient-Based Training

Philip Sosnin, Mark N. Müller, Maximilian Baader, Calvin Tsay, Matthew Wicker

TL;DR

This work develops the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm, and certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels.

Abstract

Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. Provably bounding model behavior under such attacks remains an open problem. In this work, we address this challenge by developing the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm. In particular, our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.

Certified Robustness to Data Poisoning in Gradient-Based Training

TL;DR

This work develops the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm, and certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels.

Abstract

Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. Provably bounding model behavior under such attacks remains an open problem. In this work, we address this challenge by developing the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm. In particular, our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.
Paper Structure (34 sections, 6 theorems, 48 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 34 sections, 6 theorems, 48 equations, 8 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Given valid parameter bounds $[\theta^L, \theta^U]$ for an adversary $\mathcal{T}(\mathcal{D})$, one can compute a sound upper bound (i.e., certificate) on any poisoning objective by optimization over the parameter space, rather than dataset space: where $J$ is one of the objective functions from eq:untargeted--eq:backdoor. Full expressions are provided in Appendix appendix:paramcertproof.

Figures (8)

  • Figure 1: Bounds on a classification threshold trained on the halfmoons dataset for a bounded adversary that can perturb up to $n$ data-points by up to $\epsilon$ in the $p=\infty$ norm; in label space, the adversary may flip up to $m$ labels (corresponding to $\gamma=1, q=0$). The white line shows the decision boundary of the nominal classification model. The coloured regions show the areas for which we cannot certify robustness for the given adversary strength.
  • Figure 2: Mean squared error bounds on the UCI-houseelectric dataset. Top: Effect of adversary strength on certified bounds for a fixed model. Bottom: Effect of model/training hyperparameters on certified bounds for a fixed feature poisoning adversary $(n=100, \epsilon=0.01, p=\infty)$. Hyperparameters $d=1, w=50, b=10000$, and $\alpha=0.02$ where not stated.
  • Figure 3: Certified accuracy on the MNIST dataset under a label-flipping attack.
  • Figure 4: Certified accuracy (left) and backdoor accuracy (right) for a binary classifier fine-tuned on the Drusen class of OCTMNIST for an attack size up to 10% poisoned data per batch ($b=6000, p=\infty, q=0, \nu=1$). Dashed lines show the nominal accuracy of each fine-tuned model.
  • Figure 5: Left: Fine-tuning PilotNet on unseen data with a bounded label poisoning attack ($q=\infty$). Right: Steering angle prediction bounds after fine-tuning ($m=300, q=\infty, \nu=0.01$).
  • ...and 3 more figures

Theorems & Definitions (9)

  • Definition 1
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3: Bounding the descent direction for a bounded adversary
  • Proposition 1: Explicit upper bounds of neural network $f$ with interval parameters
  • Theorem B.1: Bounding the descent direction for an unbounded adversary
  • Definition 2: Interval Matrix Arithmetic
  • Definition 3: Interval Matrix Multiplication
  • Proposition 2: Explicit output bounds of neural network $f$ with interval parameters