Table of Contents
Fetching ...

Adversarial Training for Defense Against Label Poisoning Attacks

Melis Ilayda Bal, Volkan Cevher, Michael Muehlebach

TL;DR

This work tackles the vulnerability of predictive models to label poisoning by introducing Floral, a kernel SVM–based adversarial training defense formulated as a non-zero-sum Stackelberg game between an attacker and the learner. Floral solves a bilevel optimization with a PGD-based algorithm, focusing on adversarial updates to the labels of influential training points to robustify the decision boundary. The authors provide a local stability analysis and demonstrate through Moon, MNIST, and IMDB experiments that Floral achieves higher robust accuracy than robust baselines and even RoBERTa–based systems, while maintaining competitive clean accuracy. The approach is adaptable to multi-class settings and neural networks, and generalizes to several label-poisoning attacks, highlighting its practical significance for robust deployment in adversarial environments.

Abstract

As machine learning models grow in complexity and increasingly rely on publicly sourced data, such as the human-annotated labels used in training large language models, they become more vulnerable to label poisoning attacks. These attacks, in which adversaries subtly alter the labels within a training dataset, can severely degrade model performance, posing significant risks in critical applications. In this paper, we propose FLORAL, a novel adversarial training defense strategy based on support vector machines (SVMs) to counter these threats. Utilizing a bilevel optimization framework, we cast the training process as a non-zero-sum Stackelberg game between an attacker, who strategically poisons critical training labels, and the model, which seeks to recover from such attacks. Our approach accommodates various model architectures and employs a projected gradient descent algorithm with kernel SVMs for adversarial training. We provide a theoretical analysis of our algorithm's convergence properties and empirically evaluate FLORAL's effectiveness across diverse classification tasks. Compared to robust baselines and foundation models such as RoBERTa, FLORAL consistently achieves higher robust accuracy under increasing attacker budgets. These results underscore the potential of FLORAL to enhance the resilience of machine learning models against label poisoning threats, thereby ensuring robust classification in adversarial settings.

Adversarial Training for Defense Against Label Poisoning Attacks

TL;DR

This work tackles the vulnerability of predictive models to label poisoning by introducing Floral, a kernel SVM–based adversarial training defense formulated as a non-zero-sum Stackelberg game between an attacker and the learner. Floral solves a bilevel optimization with a PGD-based algorithm, focusing on adversarial updates to the labels of influential training points to robustify the decision boundary. The authors provide a local stability analysis and demonstrate through Moon, MNIST, and IMDB experiments that Floral achieves higher robust accuracy than robust baselines and even RoBERTa–based systems, while maintaining competitive clean accuracy. The approach is adaptable to multi-class settings and neural networks, and generalizes to several label-poisoning attacks, highlighting its practical significance for robust deployment in adversarial environments.

Abstract

As machine learning models grow in complexity and increasingly rely on publicly sourced data, such as the human-annotated labels used in training large language models, they become more vulnerable to label poisoning attacks. These attacks, in which adversaries subtly alter the labels within a training dataset, can severely degrade model performance, posing significant risks in critical applications. In this paper, we propose FLORAL, a novel adversarial training defense strategy based on support vector machines (SVMs) to counter these threats. Utilizing a bilevel optimization framework, we cast the training process as a non-zero-sum Stackelberg game between an attacker, who strategically poisons critical training labels, and the model, which seeks to recover from such attacks. Our approach accommodates various model architectures and employs a projected gradient descent algorithm with kernel SVMs for adversarial training. We provide a theoretical analysis of our algorithm's convergence properties and empirically evaluate FLORAL's effectiveness across diverse classification tasks. Compared to robust baselines and foundation models such as RoBERTa, FLORAL consistently achieves higher robust accuracy under increasing attacker budgets. These results underscore the potential of FLORAL to enhance the resilience of machine learning models against label poisoning threats, thereby ensuring robust classification in adversarial settings.

Paper Structure

This paper contains 52 sections, 5 theorems, 46 equations, 18 figures, 14 tables.

Key Result

Lemma 1

Let $(\Hat{\lambda}, \Hat{y}(\hat{\lambda}))$ denote a Stackelberg equilibrium, i.e., $\Hat{y}(\Hat{\lambda}) := \textsc{LFlip}(\Hat{\lambda})$ and $\Hat{\lambda} := \textsc{Prox}_{\mathcal{S}(\Hat{y}(\Hat{\lambda}))}(\Hat{z}) = \textsc{Prox}_{\mathcal{S}(\Hat{y}(\Hat{\lambda}))}(\Hat{\lambda} - \et where $\kappa_{y}$ is a constant defined by the Prox operator and index set corresponding to $\lamb

Figures (18)

  • Figure 1: (a): The illustration of Floral defense, adversarial training under label poisoning attacks. (b): The test accuracy degradation of RoBERTa fine-tuned on the IMDB dataset with adversarial labels, showing its vulnerability to such attacks. (c): Floral effectively mitigates the impact of label poisoning in (b), achieving significantly higher robust accuracy.
  • Figure 2: Sensitivity of the decision boundary to label poisoning attacks. The vulnerability of data points differs between feature perturbation and label poisoning attacks. Given a perfect classifier, points near the decision boundary are less robust to feature attacks gairatexplore-exploit-db-dynamics, leading to localized shifts in classification regions when the attack is performed. In contrast, the decision boundary has a broader sensitivity with respect to label poisoning attacks which can affect both near-boundary and distant points. By injecting incorrect labels, these attacks can create more widespread disruption and an overall degradation in classifier performance across the input space.
  • Figure 3: Test accuracy of methods on the Moon dataset under varying label poisoning levels. For SVM models, $C=10$, $\gamma=1$ are used. See Appendix \ref{['app:additional-experiment-results']} (Figure \ref{['fig:moon-exp-results-plots-appendix']}) for results with other settings. As the label poisoning level increases, the accuracy of methods generally declines, however, Floral maintains higher robust accuracy across all adversarial settings, without compromising clean accuracy.
  • Figure 4: The decision boundaries on the Moon test dataset under varying label poisoning levels. SVM models use an RBF kernel with $C=10$ and $\gamma=0.5$. Floral generates a smooth decision boundary compared to baseline methods, which show drastic changes due to adversarial training label manipulations. For the complete results with other baselines, see Appendix \ref{['app:additional-experiment-results']} (Figure \ref{['fig:moon-decision-boundaries-app-C10-gamma0.5']}).
  • Figure 5: Illustrations of the Moon training sets from an example replication, using clean and adversarial labels with poisoning levels: $5\%$, $10\%$, $25\%$.
  • ...and 13 more figures

Theorems & Definitions (13)

  • Lemma 1
  • proof : Proof
  • Lemma 2
  • proof : Proof
  • Theorem 3.1: $\varepsilon$-local asymptotic stability
  • proof : Proof (sketch)
  • proof
  • Definition 1: Prox operator
  • Lemma 3: Bounded iterates
  • proof
  • ...and 3 more