Table of Contents
Fetching ...

DeBackdoor: A Deductive Framework for Detecting Backdoor Attacks on Deep Models with Limited Data

Dorde Popovic, Amin Sadeghi, Ting Yu, Sanjay Chawla, Issa Khalil

TL;DR

DeBackdoor tackles the practical problem of detecting backdoors in third‑party deep models before deployment under strict data and access limits. It introduces a deductive framework that searches for effective triggers via Simulated Annealing to maximize a continuous proxy of Attack Success Rate, $cASR$, using only forward passes in a black‑box setting. The method supports multiple trigger families and attack strategies (All2One/All2All/One2One) and demonstrates near‑perfect AUROC on standard benchmarks and dynamic attacks, outperforming existing baselines. Its significance lies in enabling safe integration of third‑party models in safety‑critical systems without requiring full data or white‑box access.

Abstract

Backdoor attacks are among the most effective, practical, and stealthy attacks in deep learning. In this paper, we consider a practical scenario where a developer obtains a deep model from a third party and uses it as part of a safety-critical system. The developer wants to inspect the model for potential backdoors prior to system deployment. We find that most existing detection techniques make assumptions that are not applicable to this scenario. In this paper, we present a novel framework for detecting backdoors under realistic restrictions. We generate candidate triggers by deductively searching over the space of possible triggers. We construct and optimize a smoothed version of Attack Success Rate as our search objective. Starting from a broad class of template attacks and just using the forward pass of a deep model, we reverse engineer the backdoor attack. We conduct extensive evaluation on a wide range of attacks, models, and datasets, with our technique performing almost perfectly across these settings.

DeBackdoor: A Deductive Framework for Detecting Backdoor Attacks on Deep Models with Limited Data

TL;DR

DeBackdoor tackles the practical problem of detecting backdoors in third‑party deep models before deployment under strict data and access limits. It introduces a deductive framework that searches for effective triggers via Simulated Annealing to maximize a continuous proxy of Attack Success Rate, , using only forward passes in a black‑box setting. The method supports multiple trigger families and attack strategies (All2One/All2All/One2One) and demonstrates near‑perfect AUROC on standard benchmarks and dynamic attacks, outperforming existing baselines. Its significance lies in enabling safe integration of third‑party models in safety‑critical systems without requiring full data or white‑box access.

Abstract

Backdoor attacks are among the most effective, practical, and stealthy attacks in deep learning. In this paper, we consider a practical scenario where a developer obtains a deep model from a third party and uses it as part of a safety-critical system. The developer wants to inspect the model for potential backdoors prior to system deployment. We find that most existing detection techniques make assumptions that are not applicable to this scenario. In this paper, we present a novel framework for detecting backdoors under realistic restrictions. We generate candidate triggers by deductively searching over the space of possible triggers. We construct and optimize a smoothed version of Attack Success Rate as our search objective. Starting from a broad class of template attacks and just using the forward pass of a deep model, we reverse engineer the backdoor attack. We conduct extensive evaluation on a wide range of attacks, models, and datasets, with our technique performing almost perfectly across these settings.

Paper Structure

This paper contains 27 sections, 8 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Examples of patch-based gu2017badnets, blending-based chen2017targeted, filter-based fu2023eagle, warping-based nguyen2021wanet, and learning-based doan2021lira triggers injected into a clean image of a traffic sign. Our framework is capable of detecting all of these triggers.
  • Figure 2: Left: An overview of our process to generate a patch gu2017badnets trigger to achieve the highest continuous Attack Success Rate (cASR). We start with a random trigger. Throughout Algorithm \ref{['alg:search']}, the pattern, size, shape, and location of the trigger evolve to increase cASR and become more similar to the original trigger. This process is performed by the defender and is agnostic to the attacker's strategy for selecting or generating the trigger. Right: The relationship between the Attack Success Rate (ASR) and continuous Attack Success Rate (cASR), where Pearson's correlation coefficient $r = 0.9998$. Therefore, while cASR is continuous, it also closely approximates ASR.
  • Figure 3: A comparison of our detection technique to baselines and all submissions made to the Trojan Detection Challenge (TDC) neurips2022tdc across the three evaluation tasks.
  • Figure 4: The continuous Attack success rate (cASR) detection scores across different attack techniques and datasets. Given each dataset and each attack technique, 125 clean and 62 backdoored models are used for measuring detection performance.
  • Figure 5: A comparison of the Continuous Attack Success Rate (cASR) of synthesized triggers for models injected with triggers of various sizes and DeBackdoor run with different settings of the size limit search parameter $\delta_{S}$.
  • ...and 7 more figures