Table of Contents
Fetching ...

Adapting to Evolving Adversaries with Regularized Continual Robust Training

Sihui Dai, Christian Cianfarani, Arjun Bhagoji, Vikash Sehwag, Prateek Mittal

TL;DR

This work tackles the problem of defending ML models against evolving, time-delayed adversaries by formulating Continual Adaptive Robustness (CAR) and proposing Regularized Continual Robust Training (RCRT). The core idea is to combine robust pre-training with iterative robust fine-tuning while regularizing representations in logit space to limit robustness degradation across attacks. The authors prove bounds linking cross-attack robustness gaps to logit perturbations and show that logit-space regularization can reduce forgetting and improve performance on unseen attacks. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNette across over 100 attack combinations demonstrate that RCRT improves robust accuracy with modest training-time overhead, offering a practical path toward deploying models robust to evolving threats.

Abstract

Robust training methods typically defend against specific attack types, such as Lp attacks with fixed budgets, and rarely account for the fact that defenders may encounter new attacks over time. A natural solution is to adapt the defended model to new adversaries as they arise via fine-tuning, a method which we call continual robust training (CRT). However, when implemented naively, fine-tuning on new attacks degrades robustness on previous attacks. This raises the question: how can we improve the initial training and fine-tuning of the model to simultaneously achieve robustness against previous and new attacks? We present theoretical results which show that the gap in a model's robustness against different attacks is bounded by how far each attack perturbs a sample in the model's logit space, suggesting that regularizing with respect to this logit space distance can help maintain robustness against previous attacks. Extensive experiments on 3 datasets (CIFAR-10, CIFAR-100, and ImageNette) and over 100 attack combinations demonstrate that the proposed regularization improves robust accuracy with little overhead in training time. Our findings and open-source code lay the groundwork for the deployment of models robust to evolving attacks.

Adapting to Evolving Adversaries with Regularized Continual Robust Training

TL;DR

This work tackles the problem of defending ML models against evolving, time-delayed adversaries by formulating Continual Adaptive Robustness (CAR) and proposing Regularized Continual Robust Training (RCRT). The core idea is to combine robust pre-training with iterative robust fine-tuning while regularizing representations in logit space to limit robustness degradation across attacks. The authors prove bounds linking cross-attack robustness gaps to logit perturbations and show that logit-space regularization can reduce forgetting and improve performance on unseen attacks. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNette across over 100 attack combinations demonstrate that RCRT improves robust accuracy with modest training-time overhead, offering a practical path toward deploying models robust to evolving threats.

Abstract

Robust training methods typically defend against specific attack types, such as Lp attacks with fixed budgets, and rarely account for the fact that defenders may encounter new attacks over time. A natural solution is to adapt the defended model to new adversaries as they arise via fine-tuning, a method which we call continual robust training (CRT). However, when implemented naively, fine-tuning on new attacks degrades robustness on previous attacks. This raises the question: how can we improve the initial training and fine-tuning of the model to simultaneously achieve robustness against previous and new attacks? We present theoretical results which show that the gap in a model's robustness against different attacks is bounded by how far each attack perturbs a sample in the model's logit space, suggesting that regularizing with respect to this logit space distance can help maintain robustness against previous attacks. Extensive experiments on 3 datasets (CIFAR-10, CIFAR-100, and ImageNette) and over 100 attack combinations demonstrate that the proposed regularization improves robust accuracy with little overhead in training time. Our findings and open-source code lay the groundwork for the deployment of models robust to evolving attacks.

Paper Structure

This paper contains 32 sections, 3 theorems, 26 equations, 11 figures, 18 tables.

Key Result

Theorem 4.1

Assume that loss $\ell(\hat{y},y)$ is $M_1$-Lipschitz in $\|\cdot\|_2$, for $\hat{y} \in h(X)$ with $M_1 > 0$ and bounded by $M_2 > 0$We note that surrogate losses such as the cross-entropy used during training are not bounded, but the $0-1$ loss which is often the key quantity of interest is bounde where $D = M_2\sqrt{\frac{\log(\rho/2)}{-2n}}$.

Figures (11)

  • Figure 1: An overview of the problem of adapting to new adversaries (Continual Adaptive Robustness) and our solution framework (Regularized Continual Robust Training). In this problem, the defender learns about the existence of new attacks sequentially, and at time $t$ aims to achieve robustness against all attacks seen at times $\le t$. The model is deployed at time $0$ to be robust against an initial set of attacks, new attacks are introduced at times $t_1$, $t_2$, and $t_3$. We propose to performing initial robust training when the first attack (or set of attacks) is available and then use fine-tuning to adapt the model against future attacks within time $\Delta t$.
  • Figure 2: Ablation 2: Change in union robust accuracy after fine-tuning with regularization (initial model does not use regularization). We fine-tune models on Imagenette across 144 pairs of initial attack and new attack. The initial attack corresponds to the row of each grid and new attack corresponds to each column. Values represent differences between the accuracy measured on a model fine-tuned with and without regularization. Gains in accuracy of at least 1% are highlighted in green, while drops in accuracy of at least 1% in red. Further results are in Appendix \ref{['app:fine-tuning']}.
  • Figure 3: Adversarial loss gap ($\mathcal{L}_{1,2}(h) - \mathcal{L}(h)$) and average $\ell_2$ distance between logits of $\ell_2$ ($\epsilon = 0.5$, representing $P_{C_1}$) and StAdv ($\epsilon = 0.05$, representing $P_{C_2}$) attacked samples over 25 epochs of fine-tuning using croce2022adversarial's fine-tuning method, both with and without regularization. Each model is fine-tuned starting from a model that is adversarially trained against an $\ell_2$ adversary, as described in Section \ref{['sec:exp_setup']}. In all training scenarios, there is a visible correlation between the loss gap and the logit distance, aligning with the theoretical result in Corollary \ref{['thm:corollary']}.
  • Figure 4: Change in robust accuracy after fine-tuning with models initally trained with adversarial $\ell_2$ regularization different initial attack and new attack pairs. We fine-tune models on Imagenette across 144 pairs of initial attack and new attack. The initial attack corresponds to the row of each grid and new attack corresponds to each column. Values represent differences between the accuracy measured on a model fine-tuned with and without regularization in initial training. Gains in accuracy of at least 1% are highlighted in green, while drops in accuracy of at least 1% are highlighted in red.
  • Figure 5: Change in robust accuracy after fine-tuning with models initally trained with variation regularization different initial attack and new attack pairs. We fine-tune models on Imagenette across 144 pairs of initial attack and new attack. The initial attack corresponds to the row of each grid and new attack corresponds to each column. Values represent differences between the accuracy measured on a model fine-tuned with and without regularization in initial training. Gains in accuracy of at least 1% are highlighted in green, while drops in accuracy of at least 1% are highlighted in red.
  • ...and 6 more figures

Theorems & Definitions (7)

  • Definition 3.1: Continual Adaptive Robustness dai2024position
  • Theorem 4.1
  • Corollary 4.2
  • proof
  • proof
  • Corollary 3.1
  • proof