Game of Trojans: Adaptive Adversaries Against Output-based Trojaned-Model Detectors
Dinuka Sahabandu, Xiaojun Xu, Arezoo Rajabi, Luyao Niu, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
TL;DR
The paper analyzes an adaptive adversary capable of retraining Trojaned DNNs with full knowledge of state-of-the-art output-based detectors and models their interaction as an iterative co-evolution game. It formalizes a min–max framework where the Trojaned model and detector are updated in alternation, proving that at equilibrium the Trojaned and clean output distributions become indistinguishable ($q_T = q_C$) and the adversary evades detection. A greedy, submodularity-based approach for selecting trigger-embedding samples is proposed, with a proven $1/2$-optimality guarantee for common losses, and empirical results across MNIST, CIFAR-10/100, and SpeechCommand show robust evasion of MNTD, NeuralCleanse, STRIP, and TABOR. The findings emphasize the vulnerability of current output-based detectors to adaptive adversaries and motivate the development of detectors capable of countering co-evolutionary backdoor attacks with stronger guarantees and broader threat models.
Abstract
We propose and analyze an adaptive adversary that can retrain a Trojaned DNN and is also aware of SOTA output-based Trojaned model detectors. We show that such an adversary can ensure (1) high accuracy on both trigger-embedded and clean samples and (2) bypass detection. Our approach is based on an observation that the high dimensionality of the DNN parameters provides sufficient degrees of freedom to simultaneously achieve these objectives. We also enable SOTA detectors to be adaptive by allowing retraining to recalibrate their parameters, thus modeling a co-evolution of parameters of a Trojaned model and detectors. We then show that this co-evolution can be modeled as an iterative game, and prove that the resulting (optimal) solution of this interactive game leads to the adversary successfully achieving the above objectives. In addition, we provide a greedy algorithm for the adversary to select a minimum number of input samples for embedding triggers. We show that for cross-entropy or log-likelihood loss functions used by the DNNs, the greedy algorithm provides provable guarantees on the needed number of trigger-embedded input samples. Extensive experiments on four diverse datasets -- MNIST, CIFAR-10, CIFAR-100, and SpeechCommand -- reveal that the adversary effectively evades four SOTA output-based Trojaned model detectors: MNTD, NeuralCleanse, STRIP, and TABOR.
