Table of Contents
Fetching ...

Robustness-Congruent Adversarial Training for Secure Machine Learning Model Updates

Daniele Angioni, Luca Demetrio, Maura Pintor, Luca Oneto, Davide Anguita, Battista Biggio, Fabio Roli

TL;DR

This work shows that updating robust machine-learning models can induce regression in robustness, not just accuracy, through robustness negative flips (RNFs). It introduces robustness-congruent adversarial training (RCAT), a non-regression–constrained update method that combines adversarial training with a non-regression penalty to preserve robustness on samples unchanged by the old model. Theoretical results establish estimator consistency under non-regression constraints, while empirical experiments on CIFAR-10 and ImageNet demonstrate RCAT’s superior trade-offs, reducing both negative flips and RNFs compared to PCT and PCAT. The approach offers a principled pathway for secure, backward-compatible model updates in vision tasks and potentially other domains with evolving data and robustness requirements.

Abstract

Machine-learning models demand periodic updates to improve their average accuracy, exploiting novel architectures and additional data. However, a newly updated model may commit mistakes the previous model did not make. Such misclassifications are referred to as negative flips, experienced by users as a regression of performance. In this work, we show that this problem also affects robustness to adversarial examples, hindering the development of secure model update practices. In particular, when updating a model to improve its adversarial robustness, previously ineffective adversarial attacks on some inputs may become successful, causing a regression in the perceived security of the system. We propose a novel technique, named robustness-congruent adversarial training, to address this issue. It amounts to fine-tuning a model with adversarial training, while constraining it to retain higher robustness on the samples for which no adversarial example was found before the update. We show that our algorithm and, more generally, learning with non-regression constraints, provides a theoretically-grounded framework to train consistent estimators. Our experiments on robust models for computer vision confirm that both accuracy and robustness, even if improved after model update, can be affected by negative flips, and our robustness-congruent adversarial training can mitigate the problem, outperforming competing baseline methods.

Robustness-Congruent Adversarial Training for Secure Machine Learning Model Updates

TL;DR

This work shows that updating robust machine-learning models can induce regression in robustness, not just accuracy, through robustness negative flips (RNFs). It introduces robustness-congruent adversarial training (RCAT), a non-regression–constrained update method that combines adversarial training with a non-regression penalty to preserve robustness on samples unchanged by the old model. Theoretical results establish estimator consistency under non-regression constraints, while empirical experiments on CIFAR-10 and ImageNet demonstrate RCAT’s superior trade-offs, reducing both negative flips and RNFs compared to PCT and PCAT. The approach offers a principled pathway for secure, backward-compatible model updates in vision tasks and potentially other domains with evolving data and robustness requirements.

Abstract

Machine-learning models demand periodic updates to improve their average accuracy, exploiting novel architectures and additional data. However, a newly updated model may commit mistakes the previous model did not make. Such misclassifications are referred to as negative flips, experienced by users as a regression of performance. In this work, we show that this problem also affects robustness to adversarial examples, hindering the development of secure model update practices. In particular, when updating a model to improve its adversarial robustness, previously ineffective adversarial attacks on some inputs may become successful, causing a regression in the perceived security of the system. We propose a novel technique, named robustness-congruent adversarial training, to address this issue. It amounts to fine-tuning a model with adversarial training, while constraining it to retain higher robustness on the samples for which no adversarial example was found before the update. We show that our algorithm and, more generally, learning with non-regression constraints, provides a theoretically-grounded framework to train consistent estimators. Our experiments on robust models for computer vision confirm that both accuracy and robustness, even if improved after model update, can be affected by negative flips, and our robustness-congruent adversarial training can mitigate the problem, outperforming competing baseline methods.
Paper Structure (19 sections, 1 theorem, 24 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 1 theorem, 24 equations, 2 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let us consider a learnable $\mathcal{F}$, in the sense of Eq. eq:eq_leanable, and $\hat{f}$ and $f^*$ defined as in Eqns. eq:erm_nr and eq:rm_nr respectively. Then it is possible to prove the result of Eq. eq:consistency.

Figures (2)

  • Figure 1: Regression modes in machine-learning model updates. Left: Regression of accuracy induced by negative flips (NFs). When updating an old model $f^{\rm old}$ (dashed black line) with a new model $f^{\rm new}$ (solid black line), a test sample $\boldsymbol{x}$ that was correctly classified by $f^{\rm old}$ may be misclassified by $f^{\rm new}$, causing an NF. Right: Regression of robustness induced by robustness negative flips (RNFs). In a different setting, the test sample $\boldsymbol{x}$ may still be correctly classified by $f^{\rm new}$. However, while no adversarial examples are found against $f^{\rm old}$ (since the perturbation domain $\mathcal{B}(\boldsymbol{x})$, represented by the dashed gray box around $\boldsymbol{x}$, never intersects the decision boundary of $f^{\rm old}$), an adversarial example $\boldsymbol{x}^\prime$ is found against $f^{\rm new}$, causing an RNF.
  • Figure 2: Regression of robustness and accuracy for the CIFAR-10 robust models $C_1, \ldots, C_7$Engstrom2019RobustnessZhang2020AttacksRice2020OverfittingRade2021HelperHendrycks2019UsingAddepalli2021TowardsCarmon2019Unlabeled. (a) Left: Robust error (%) of each model, sorted in descending order. Right: RNFs (%) attained when replacing old models (in rows) with new ones (in columns). (b) Left: Test error (%) of each model, sorted in descending order. Right: NFs (%) attained when replacing old models (in rows) with new ones (in columns). Values in the upper (lower) triangular matrices evaluate regression when the new model has better (worse) average robustness/accuracy than the old model.

Theorems & Definitions (2)

  • Theorem 1
  • proof