Table of Contents
Fetching ...

Backdoor Detection through Replicated Execution of Outsourced Training

Hengrui Jia, Sierra Wyllie, Akram Bin Sediq, Ahmed Ibrahim, Nicolas Papernot

TL;DR

This work addresses the problem of detecting backdoors in models trained via outsourced cloud providers when the training process itself can be compromised. It introduces RTTD (Replicate Training To Detect), which partitions the training into $k$-step sub-runs and replicates a subset across $n$ non-colluding servers to build a distribution of benign updates, enabling anomaly-based detection without requiring knowledge of the backdoor trigger. By comparing pairwise model distances with metrics such as $Zest$, $CKA$, or output-space distance and applying a Kolmogorov–Smirnov test, RTTD achieves high detection accuracy (up to $99.6\%$) even under adaptive adversaries and across CV and language tasks, while incurring manageable overhead ($m k (n-1)$ extra training steps). The approach is practical for clients with limited compute, scales to multiple providers, and provides a meaningful alternative to signature-based defenses in outsourced training settings.

Abstract

It is common practice to outsource the training of machine learning models to cloud providers. Clients who do so gain from the cloud's economies of scale, but implicitly assume trust: the server should not deviate from the client's training procedure. A malicious server may, for instance, seek to insert backdoors in the model. Detecting a backdoored model without prior knowledge of both the backdoor attack and its accompanying trigger remains a challenging problem. In this paper, we show that a client with access to multiple cloud providers can replicate a subset of training steps across multiple servers to detect deviation from the training procedure in a similar manner to differential testing. Assuming some cloud-provided servers are benign, we identify malicious servers by the substantial difference between model updates required for backdooring and those resulting from clean training. Perhaps the strongest advantage of our approach is its suitability to clients that have limited-to-no local compute capability to perform training; we leverage the existence of multiple cloud providers to identify malicious updates without expensive human labeling or heavy computation. We demonstrate the capabilities of our approach on an outsourced supervised learning task where $50\%$ of the cloud providers insert their own backdoor; our approach is able to correctly identify $99.6\%$ of them. In essence, our approach is successful because it replaces the signature-based paradigm taken by existing approaches with an anomaly-based detection paradigm. Furthermore, our approach is robust to several attacks from adaptive adversaries utilizing knowledge of our detection scheme.

Backdoor Detection through Replicated Execution of Outsourced Training

TL;DR

This work addresses the problem of detecting backdoors in models trained via outsourced cloud providers when the training process itself can be compromised. It introduces RTTD (Replicate Training To Detect), which partitions the training into -step sub-runs and replicates a subset across non-colluding servers to build a distribution of benign updates, enabling anomaly-based detection without requiring knowledge of the backdoor trigger. By comparing pairwise model distances with metrics such as , , or output-space distance and applying a Kolmogorov–Smirnov test, RTTD achieves high detection accuracy (up to ) even under adaptive adversaries and across CV and language tasks, while incurring manageable overhead ( extra training steps). The approach is practical for clients with limited compute, scales to multiple providers, and provides a meaningful alternative to signature-based defenses in outsourced training settings.

Abstract

It is common practice to outsource the training of machine learning models to cloud providers. Clients who do so gain from the cloud's economies of scale, but implicitly assume trust: the server should not deviate from the client's training procedure. A malicious server may, for instance, seek to insert backdoors in the model. Detecting a backdoored model without prior knowledge of both the backdoor attack and its accompanying trigger remains a challenging problem. In this paper, we show that a client with access to multiple cloud providers can replicate a subset of training steps across multiple servers to detect deviation from the training procedure in a similar manner to differential testing. Assuming some cloud-provided servers are benign, we identify malicious servers by the substantial difference between model updates required for backdooring and those resulting from clean training. Perhaps the strongest advantage of our approach is its suitability to clients that have limited-to-no local compute capability to perform training; we leverage the existence of multiple cloud providers to identify malicious updates without expensive human labeling or heavy computation. We demonstrate the capabilities of our approach on an outsourced supervised learning task where of the cloud providers insert their own backdoor; our approach is able to correctly identify of them. In essence, our approach is successful because it replaces the signature-based paradigm taken by existing approaches with an anomaly-based detection paradigm. Furthermore, our approach is robust to several attacks from adaptive adversaries utilizing knowledge of our detection scheme.

Paper Structure

This paper contains 29 sections, 22 figures, 2 tables, 1 algorithm.

Figures (22)

  • Figure 1: An illustration of RTTD, our proposed approach for detecting backdoor insertion. The client submits a training task to the primary server, denoted by Server$_1$, and downloads the intermediate model checkpoint after every sub-run, each of which consists of $k$ training updates. At the beginning of an arbitrary sub-run at training step $t$, the training task and model parameters $W_t$ are probabilistically submitted to the $n$ servers the client has access to ($n=3$ in this diagram). Afterward, the client collects $W_{t+k,1} \dots W_{t+k,n}$ returned by the servers, and computes the pairwise distances among them based on a chosen distance metric, or equivalently, $distance(W_{t+k,i}, W_{t+k,j})$$\forall i,j \in [1 \dots n]$ and $i < j$. It is expected that the distances among models returned by benign servers form a cluster since they are obtained by running the same training process. We thus verify if pairwise distances from the primary server to other clean models fall in this cluster to detect primary servers that are acting maliciously.
  • Figure 2: Cost overhead incurred by replicated training in RTTD as a function of the number of additional servers across different numbers of replicated sub-runs, $m$. We consider a training job on CIFAR10 containing 200 epochs and the sub-run length $k$ is set to 2000 steps such that there are 45 subruns in total. The cost is presented as a percentage of the number of training steps in a full training run on the left y-axis, and as monetary cost on the right y-axis according to AWS SageMaker using the same compute as described in \ref{['ssec:exp_platform']}. Note that the right y-axis is simply a rescaled version of the left axis.
  • Figure 3: The anomaly index of Neural Cleanse Wang2019Neural falls below 2, the backdoor detection threshold, when a malicious server lowers the learning rate for backdoored data. We represent this as the ratio between the clean and backdoor learning rates. We report the 95% confidence interval taken over ten random seeds corresponding to ten backdoored models per learning rate ratio, all on CIFAR10.
  • Figure 4: Histogram of pairwise distances among 16 ResNet models on CIFAR10 from a sub-run with the length of five epochs, where the x-axis represents the model distance, and the y-axis is for the count of pairwise distances within each distance range. Eight models are trained by benign servers (denoted by $W_c$) and the other eight are trained by malicious servers using different backdoor strategies or triggers (denoted by $W_b$). In \ref{['subfig:main_benign']}, different colors are used to represent the categories that each of the model pairs belongs to when computing pairwise distances. Note that distance$(W_c,W_c)$ all fall into a single bin whereas distance$(W_b,W_c)$ and distance$(W_b,W_b)$ have larger variances. \ref{['subfig1:cluster_benign']} is what the clients observe. The distances from the model computed by the primary server to all other models are represented by the bins distance$(W_{primary},W)$, in which $r \cdot n - 1$ instances are selected to approximate distance$(W_{primary},W_c)$ as circled in red. They overlap with the cluster circled in blue approximating distance$(W_c,W_c)$, meaning the primary server is benign.
  • Figure 5: Pairwise Zest distances between clean and backdoored models under an adaptive attack. For reference, see the distances between two clean models (the curve labeled "benign"). We consider four scenarios including two types of backdoor attacks and two degrees of adversarial knowledge about the Zest distance computation in RTTD. For each scenario, the plotted curve represents the best pairwise distances obtained across five runs of adaptive attack (i.e., the trained $W_b$ whose distance$(W_b,W_c)$ is closest to the distribution of distance$(W_c,W_c)$.). Despite the adversary being able to decrease distance$(W_b,W_c)$ by increasing the number of masked samples, they still do not overlap with distance$(W_c,W_c)$.
  • ...and 17 more figures