Table of Contents
Fetching ...

Keeping up with dynamic attackers: Certifying robustness to adaptive online data poisoning

Avinandan Bose, Laurent Lessard, Maryam Fazel, Krishnamurthy Dj Dvijotham

TL;DR

The paper addresses robustness of online learning systems to adaptive, dynamic data poisoning by adversaries who observe and react to the learning process. It introduces a general certificate framework that reduces the worst-case impact of online poisoning to a finite-dimensional optimization problem via duality, accommodating arbitrary learning algorithms. The framework is instantiated for mean estimation and linear binary classification, with a meta-learning scheme to tune defense parameters by trading off nominal performance and robustness bounds, and is validated on synthetic and real datasets including image and reward-learning tasks. This work provides a principled, scalable approach to certify robustness in online, feedback-driven settings such as RLHF, guiding the design of defenses for continuously updated models.

Abstract

The rise of foundation models fine-tuned on human feedback from potentially untrusted users has increased the risk of adversarial data poisoning, necessitating the study of robustness of learning algorithms against such attacks. Existing research on provable certified robustness against data poisoning attacks primarily focuses on certifying robustness for static adversaries who modify a fraction of the dataset used to train the model before the training algorithm is applied. In practice, particularly when learning from human feedback in an online sense, adversaries can observe and react to the learning process and inject poisoned samples that optimize adversarial objectives better than when they are restricted to poisoning a static dataset once, before the learning algorithm is applied. Indeed, it has been shown in prior work that online dynamic adversaries can be significantly more powerful than static ones. We present a novel framework for computing certified bounds on the impact of dynamic poisoning, and use these certificates to design robust learning algorithms. We give an illustration of the framework for the mean estimation and binary classification problems and outline directions for extending this in further work. The code to implement our certificates and replicate our results is available at https://github.com/Avinandan22/Certified-Robustness.

Keeping up with dynamic attackers: Certifying robustness to adaptive online data poisoning

TL;DR

The paper addresses robustness of online learning systems to adaptive, dynamic data poisoning by adversaries who observe and react to the learning process. It introduces a general certificate framework that reduces the worst-case impact of online poisoning to a finite-dimensional optimization problem via duality, accommodating arbitrary learning algorithms. The framework is instantiated for mean estimation and linear binary classification, with a meta-learning scheme to tune defense parameters by trading off nominal performance and robustness bounds, and is validated on synthetic and real datasets including image and reward-learning tasks. This work provides a principled, scalable approach to certify robustness in online, feedback-driven settings such as RLHF, guiding the design of defenses for continuously updated models.

Abstract

The rise of foundation models fine-tuned on human feedback from potentially untrusted users has increased the risk of adversarial data poisoning, necessitating the study of robustness of learning algorithms against such attacks. Existing research on provable certified robustness against data poisoning attacks primarily focuses on certifying robustness for static adversaries who modify a fraction of the dataset used to train the model before the training algorithm is applied. In practice, particularly when learning from human feedback in an online sense, adversaries can observe and react to the learning process and inject poisoned samples that optimize adversarial objectives better than when they are restricted to poisoning a static dataset once, before the learning algorithm is applied. Indeed, it has been shown in prior work that online dynamic adversaries can be significantly more powerful than static ones. We present a novel framework for computing certified bounds on the impact of dynamic poisoning, and use these certificates to design robust learning algorithms. We give an illustration of the framework for the mean estimation and binary classification problems and outline directions for extending this in further work. The code to implement our certificates and replicate our results is available at https://github.com/Avinandan22/Certified-Robustness.

Paper Structure

This paper contains 29 sections, 10 theorems, 56 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

For any function $\lambda: \Theta \mapsto \mathbb{R}$, for any dynamic adaptive adversary, the average loss eq:avg_cost is bounded above by

Figures (4)

  • Figure 1: A schematic diagram to highlight the differences between static and dynamic poisoning.
  • Figure 2: Test performance (mean squared error between true and estimated means) upon varying the learning rates (above) and the the fraction of samples corrupted by the dynamic adversary (below) and observed that our defense significantly outperforms training without defense.
  • Figure 3: We plot the certificates of robustness for various settings (hyperparameter values) which act as upper bounds on the optimal dynamic adversary's objective. We also plot the test losses on the adversarial objective for various attacks which act as lower bounds on the objective of the optimal adversary.
  • Figure 4: Poor choice of hyperparameters of the learning algorithm can make them vulnerable to dynamic attackers as noted by our certificates and attacks (red plots). Lower values of certificate, indicate more robust learning algorithms (blue, orange, green plots).

Theorems & Definitions (23)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Remark 3.1
  • Remark 3.2
  • Theorem 3
  • proof
  • Theorem 3
  • proof
  • ...and 13 more