Table of Contents
Fetching ...

Game of Trojans: Adaptive Adversaries Against Output-based Trojaned-Model Detectors

Dinuka Sahabandu, Xiaojun Xu, Arezoo Rajabi, Luyao Niu, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

TL;DR

The paper analyzes an adaptive adversary capable of retraining Trojaned DNNs with full knowledge of state-of-the-art output-based detectors and models their interaction as an iterative co-evolution game. It formalizes a min–max framework where the Trojaned model and detector are updated in alternation, proving that at equilibrium the Trojaned and clean output distributions become indistinguishable ($q_T = q_C$) and the adversary evades detection. A greedy, submodularity-based approach for selecting trigger-embedding samples is proposed, with a proven $1/2$-optimality guarantee for common losses, and empirical results across MNIST, CIFAR-10/100, and SpeechCommand show robust evasion of MNTD, NeuralCleanse, STRIP, and TABOR. The findings emphasize the vulnerability of current output-based detectors to adaptive adversaries and motivate the development of detectors capable of countering co-evolutionary backdoor attacks with stronger guarantees and broader threat models.

Abstract

We propose and analyze an adaptive adversary that can retrain a Trojaned DNN and is also aware of SOTA output-based Trojaned model detectors. We show that such an adversary can ensure (1) high accuracy on both trigger-embedded and clean samples and (2) bypass detection. Our approach is based on an observation that the high dimensionality of the DNN parameters provides sufficient degrees of freedom to simultaneously achieve these objectives. We also enable SOTA detectors to be adaptive by allowing retraining to recalibrate their parameters, thus modeling a co-evolution of parameters of a Trojaned model and detectors. We then show that this co-evolution can be modeled as an iterative game, and prove that the resulting (optimal) solution of this interactive game leads to the adversary successfully achieving the above objectives. In addition, we provide a greedy algorithm for the adversary to select a minimum number of input samples for embedding triggers. We show that for cross-entropy or log-likelihood loss functions used by the DNNs, the greedy algorithm provides provable guarantees on the needed number of trigger-embedded input samples. Extensive experiments on four diverse datasets -- MNIST, CIFAR-10, CIFAR-100, and SpeechCommand -- reveal that the adversary effectively evades four SOTA output-based Trojaned model detectors: MNTD, NeuralCleanse, STRIP, and TABOR.

Game of Trojans: Adaptive Adversaries Against Output-based Trojaned-Model Detectors

TL;DR

The paper analyzes an adaptive adversary capable of retraining Trojaned DNNs with full knowledge of state-of-the-art output-based detectors and models their interaction as an iterative co-evolution game. It formalizes a min–max framework where the Trojaned model and detector are updated in alternation, proving that at equilibrium the Trojaned and clean output distributions become indistinguishable () and the adversary evades detection. A greedy, submodularity-based approach for selecting trigger-embedding samples is proposed, with a proven -optimality guarantee for common losses, and empirical results across MNIST, CIFAR-10/100, and SpeechCommand show robust evasion of MNTD, NeuralCleanse, STRIP, and TABOR. The findings emphasize the vulnerability of current output-based detectors to adaptive adversaries and motivate the development of detectors capable of countering co-evolutionary backdoor attacks with stronger guarantees and broader threat models.

Abstract

We propose and analyze an adaptive adversary that can retrain a Trojaned DNN and is also aware of SOTA output-based Trojaned model detectors. We show that such an adversary can ensure (1) high accuracy on both trigger-embedded and clean samples and (2) bypass detection. Our approach is based on an observation that the high dimensionality of the DNN parameters provides sufficient degrees of freedom to simultaneously achieve these objectives. We also enable SOTA detectors to be adaptive by allowing retraining to recalibrate their parameters, thus modeling a co-evolution of parameters of a Trojaned model and detectors. We then show that this co-evolution can be modeled as an iterative game, and prove that the resulting (optimal) solution of this interactive game leads to the adversary successfully achieving the above objectives. In addition, we provide a greedy algorithm for the adversary to select a minimum number of input samples for embedding triggers. We show that for cross-entropy or log-likelihood loss functions used by the DNNs, the greedy algorithm provides provable guarantees on the needed number of trigger-embedded input samples. Extensive experiments on four diverse datasets -- MNIST, CIFAR-10, CIFAR-100, and SpeechCommand -- reveal that the adversary effectively evades four SOTA output-based Trojaned model detectors: MNTD, NeuralCleanse, STRIP, and TABOR.
Paper Structure (25 sections, 4 theorems, 14 equations, 4 figures, 3 tables, 2 algorithms)

This paper contains 25 sections, 4 theorems, 14 equations, 4 figures, 3 tables, 2 algorithms.

Key Result

Proposition 4.1

For a random input data sample $x \in \mathscr{D}_{R}$, let $z_{T} := f_{\theta_T}(x)$ and $z_{C} := f_{\theta_C}(x)$ respectively denote the outputs of a Trojaned model and a clean model. Let $q_{T}$ and $q_{C}$ denote probability distributions associated with $z_{T}$ and $z_{C}$. Then, at the opti

Figures (4)

  • Figure 1: Two-step offline training of Trojaned DNN models. Step 1: the adversary uses the trigger to be embedded along with detector parameters to train an enhanced Trojaned DNN model; Step 2: the adversary uses the enhanced Trojaned DNN to compute the new detector parameters that would maximize detection. The adversary repeats Step 1 and Step 2 until there is no further improvement in detector parameters or Trojaned DNN model performance.
  • Figure 2: This figure shows a schematic of Trojan model training using our MM Trojan algorithm (Algo. \ref{['alg:minmax']}). Red pixel patterns represents Trojan triggers embedded by the adversary. In Sec. \ref{['sec:GameModel']}, we abstracted the $K$ Trojaned and $K$ clean DNN models by a single Trojan and single clean DNN model. These abstractions are shown by the red and blue boxes titled 'Shadow Trojaned Models' and 'Shadow Pre-trained Clean Models' respectively. $\mathscr{D}$ and $\Tilde{\mathscr{D}}$ represent the set of clean samples and subset of clean samples into which Trojan triggers are embedded. $\Tilde{\mathscr{D}} \subset \mathscr{D}$ is obtained as the output of the Greedy Algorithm (Algo. \ref{['alg:greedy']}). Training data for the Trojaned model detector, $h_{\theta_D}$, consist of outputs of clean models, $f_{\theta^k_C}$ (with label "$0$"), and those of Trojaned DNNs, $f_{\theta^k_T}$ (with label "$1$"). The output of the iterated game described in Eqn. (\ref{['eq:min-max-modified']}) is the trained Trojaned model $f_{\theta_T}^*$ and the trained detector $h_{\theta_D}^*$. For efficient training, following the setup of xu2021detecting, we use $K$ shadow Trojaned DNNs $f_{\theta_T^1}, f_{\theta_T^2}, \ldots , f_{\theta_T^K}$ and $K$ shadow clean DNNs $f_{\theta_C^1}, f_{\theta_C^2}, \ldots , f_{\theta_C^K}$ in our experiments in Sec. \ref{['sec:EvlnNew']}.
  • Figure 3: Adversary test-loss as a function of the fraction of data samples into which a Trojan trigger has been embedded on MNIST, CIFAR-10, CIFAR-100, and SpeechCommand datasets under modification (M) and blending attacks (B). We denote the fraction of samples into which the adversary embeds a Trojan trigger by $\alpha:=|\Tilde{\mathscr{D}}|/|\mathscr{D}|$. Blue curves show the loss term over clean samples, $\mathcal{L}_{C}(\mathscr{D},\mathscr{\Tilde{D}})$, and red curves show the loss term over samples that have been embedded with the Trojan, $\mathcal{L}_{T}(\mathscr{D},\mathscr{\Tilde{D}})$. The black curves present the total loss $\mathcal{L}_{tot}(\mathscr{D},\mathscr{\Tilde{D}})$ as defined in Eqn. \ref{['eq:Total_loss']}. We observe that for both attacks, the decrease in the total loss value, $\mathcal{L}_{tot}(\mathscr{D},\mathscr{\Tilde{D}})$, is insignificant after exceeding a threshold value of $\alpha$. This indicates that the adversary does not derive an additional (marginal) benefit by increasing the fraction of Trojan trigger-embedded samples beyond a threshold value of $\alpha$.
  • Figure 4: Classification accuracies over clean samples ($Acc$, blue) and samples into which the adversary embeds a Trojan trigger ($ASR$, red) using the MM Trojan algorithm on MNIST, CIFAR-10, CIFAR-100, and SpeechCommand datasets under modification (M) and blending (B) attacks. The increase in $ASR$ becomes less significant beyond a threshold value of the poisoning ratio, $\alpha:=|\Tilde{\mathscr{D}}|/|\mathscr{D}|$. On the other hand, the value of $Acc$ may decrease as $\alpha$ increases. This could result in a user discarding the model.

Theorems & Definitions (9)

  • Proposition 4.1
  • proof
  • Remark 5.1
  • proof
  • Definition A.1: fujishige2005submodular
  • Lemma A.2: killamsetty2021glister
  • Lemma A.3
  • Theorem A.4
  • proof