Table of Contents
Fetching ...

Gradient Shaping: Enhancing Backdoor Attack Against Reverse Engineering

Rui Zhu, Di Tang, Siyuan Tang, Guanhong Tao, Shiqing Ma, Xiaofeng Wang, Haixu Tang

TL;DR

This work analyzes why gradient-based trigger inversion effectively detects backdoors and introduces Gradient Shaping (GRASP) to weaken such inversion by increasing the trigger's local gradient sensitivity via targeted data poisoning. The authors provide theoretical results showing how the trigger effective radius governs inversion success and prove that GRASP can raise the local Lipschitz constant around trigger inputs, thereby hindering gradient-based reconstruction while maintaining backdoor efficacy. Empirically, GRASP enhances several backdoor attacks to defeat multiple inversion-based defenses (e.g., Neural Cleanse, TABOR, Pixel), without undermining resistance to weight-analysis defenses or various mitigations. The findings highlight a potential vulnerability in current defenses and suggest GRASP as a general, practical method to bolster backdoor stealth in image classification tasks, with implications for evaluation and defense design.

Abstract

Most existing methods to detect backdoored machine learning (ML) models take one of the two approaches: trigger inversion (aka. reverse engineer) and weight analysis (aka. model diagnosis). In particular, the gradient-based trigger inversion is considered to be among the most effective backdoor detection techniques, as evidenced by the TrojAI competition, Trojan Detection Challenge and backdoorBench. However, little has been done to understand why this technique works so well and, more importantly, whether it raises the bar to the backdoor attack. In this paper, we report the first attempt to answer this question by analyzing the change rate of the backdoored model around its trigger-carrying inputs. Our study shows that existing attacks tend to inject the backdoor characterized by a low change rate around trigger-carrying inputs, which are easy to capture by gradient-based trigger inversion. In the meantime, we found that the low change rate is not necessary for a backdoor attack to succeed: we design a new attack enhancement called \textit{Gradient Shaping} (GRASP), which follows the opposite direction of adversarial training to reduce the change rate of a backdoored model with regard to the trigger, without undermining its backdoor effect. Also, we provide a theoretic analysis to explain the effectiveness of this new technique and the fundamental weakness of gradient-based trigger inversion. Finally, we perform both theoretical and experimental analysis, showing that the GRASP enhancement does not reduce the effectiveness of the stealthy attacks against the backdoor detection methods based on weight analysis, as well as other backdoor mitigation methods without using detection.

Gradient Shaping: Enhancing Backdoor Attack Against Reverse Engineering

TL;DR

This work analyzes why gradient-based trigger inversion effectively detects backdoors and introduces Gradient Shaping (GRASP) to weaken such inversion by increasing the trigger's local gradient sensitivity via targeted data poisoning. The authors provide theoretical results showing how the trigger effective radius governs inversion success and prove that GRASP can raise the local Lipschitz constant around trigger inputs, thereby hindering gradient-based reconstruction while maintaining backdoor efficacy. Empirically, GRASP enhances several backdoor attacks to defeat multiple inversion-based defenses (e.g., Neural Cleanse, TABOR, Pixel), without undermining resistance to weight-analysis defenses or various mitigations. The findings highlight a potential vulnerability in current defenses and suggest GRASP as a general, practical method to bolster backdoor stealth in image classification tasks, with implications for evaluation and defense design.

Abstract

Most existing methods to detect backdoored machine learning (ML) models take one of the two approaches: trigger inversion (aka. reverse engineer) and weight analysis (aka. model diagnosis). In particular, the gradient-based trigger inversion is considered to be among the most effective backdoor detection techniques, as evidenced by the TrojAI competition, Trojan Detection Challenge and backdoorBench. However, little has been done to understand why this technique works so well and, more importantly, whether it raises the bar to the backdoor attack. In this paper, we report the first attempt to answer this question by analyzing the change rate of the backdoored model around its trigger-carrying inputs. Our study shows that existing attacks tend to inject the backdoor characterized by a low change rate around trigger-carrying inputs, which are easy to capture by gradient-based trigger inversion. In the meantime, we found that the low change rate is not necessary for a backdoor attack to succeed: we design a new attack enhancement called \textit{Gradient Shaping} (GRASP), which follows the opposite direction of adversarial training to reduce the change rate of a backdoored model with regard to the trigger, without undermining its backdoor effect. Also, we provide a theoretic analysis to explain the effectiveness of this new technique and the fundamental weakness of gradient-based trigger inversion. Finally, we perform both theoretical and experimental analysis, showing that the GRASP enhancement does not reduce the effectiveness of the stealthy attacks against the backdoor detection methods based on weight analysis, as well as other backdoor mitigation methods without using detection.
Paper Structure (37 sections, 5 theorems, 36 equations, 38 figures, 12 tables, 1 algorithm)

This paper contains 37 sections, 5 theorems, 36 equations, 38 figures, 12 tables, 1 algorithm.

Key Result

Theorem 1

Given a piece-wise linear function $\ell(\cdot): [a,b] \rightarrow [0,1]$ with global optimum on a convex hull (there exist a $c$ in this convex hull, such that $\ell(c) > \ell(x)$ for any $x \in [a,b]$), after $n$ iterations, a gradient-based optimizer starting from a random initialization converge

Figures (38)

  • Figure 1: The scatter plot shows the relationship between the trigger effective radius and the effectiveness of ten attacks to evade the NC backdoor detection (measured by AUC). The X-axis represents the trigger effective radius, and the y-axis represents the AUC score when using NCNC to detect the backdoored models under these attacks. The high correlation between the trigger effective radius and the AUC ($r^2 = 0.60$) indicates the backdoored models with high trigger effective radius are easier to be detected by the trigger inversion technique than those with low effective radius.
  • Figure 2: Comparison of the data poisoning backdoor attack by BadNet with (a) or without (b) GRASP enhancement. The GRASP enhancement contaminates trigger-inserted samples (labeled as the target class) along with the noise-added, trigger-inserted samples (labeled as the source class) into the training set, whereas the BadNet attack only contaminates the trigger-inserted samples.
  • Figure 3: Blue bars show the trigger effective radius on different backdoor attacks. The Red bars show the radius of different backdoor attacks that are enhanced by GRASP with $c = 0.1$.
  • Figure 4: The fold-change of the average trigger effective radius of the trigger-inserted data points in the GRASP attacked models compared with the BadNet attacked models.
  • Figure 5: The fold-change of the average trigger effective radius of the trigger-inserted data points in the GRASP attacked models compared with the BadNet attacked models.
  • ...and 33 more figures

Theorems & Definitions (12)

  • Definition 1: Sample specific trigger effective radius
  • Definition 2: Overall trigger effective radius
  • Theorem 1
  • Theorem 2
  • Definition 3: Astuteness
  • Definition 4: r-local minimum
  • Definition 5: Increasing rate and relaxation function
  • Definition 6: Local Lipschitz constant
  • Lemma 1
  • Definition 7: Proximal-PL condition
  • ...and 2 more