Table of Contents
Fetching ...

LSP Framework: A Compensatory Model for Defeating Trigger Reverse Engineering via Label Smoothing Poisoning

Beichen Li, Yuanfang Guo, Heqi Peng, Yangxi Li, Yunhong Wang

TL;DR

This work reframes trigger reverse engineering defenses by treating the detection objective as the sum of a classification term and a regularization term, and shows that increasing classification confidence via Label Smoothing Poisoning ($LSP$) can offset reductions in regularization, thereby defeating state-of-the-art defenses like Neural Cleanse, ABS, and ExRay. It introduces a compensatory model to quantify the necessary increase in the classification term and proposes a plug-and-play $LSP$ framework that integrates with existing backdoor attacks and remains effective across multiple datasets and attack types. Extensive experiments demonstrate that $LSP$ substantially degrades reverse-engineering defenses while preserving high attack success rates on benign models, underscoring a need for defense techniques that address this new vulnerability. The work contributes a generic paradigm for trigger reverse engineering, a formal compensatory mechanism, and a practical, compatible attack framework with broad security implications for MLaaS deployments.

Abstract

Deep neural networks are vulnerable to backdoor attacks. Among the existing backdoor defense methods, trigger reverse engineering based approaches, which reconstruct the backdoor triggers via optimizations, are the most versatile and effective ones compared to other types of methods. In this paper, we summarize and construct a generic paradigm for the typical trigger reverse engineering process. Based on this paradigm, we propose a new perspective to defeat trigger reverse engineering by manipulating the classification confidence of backdoor samples. To determine the specific modifications of classification confidence, we propose a compensatory model to compute the lower bound of the modification. With proper modifications, the backdoor attack can easily bypass the trigger reverse engineering based methods. To achieve this objective, we propose a Label Smoothing Poisoning (LSP) framework, which leverages label smoothing to specifically manipulate the classification confidences of backdoor samples. Extensive experiments demonstrate that the proposed work can defeat the state-of-the-art trigger reverse engineering based methods, and possess good compatibility with a variety of existing backdoor attacks.

LSP Framework: A Compensatory Model for Defeating Trigger Reverse Engineering via Label Smoothing Poisoning

TL;DR

This work reframes trigger reverse engineering defenses by treating the detection objective as the sum of a classification term and a regularization term, and shows that increasing classification confidence via Label Smoothing Poisoning () can offset reductions in regularization, thereby defeating state-of-the-art defenses like Neural Cleanse, ABS, and ExRay. It introduces a compensatory model to quantify the necessary increase in the classification term and proposes a plug-and-play framework that integrates with existing backdoor attacks and remains effective across multiple datasets and attack types. Extensive experiments demonstrate that substantially degrades reverse-engineering defenses while preserving high attack success rates on benign models, underscoring a need for defense techniques that address this new vulnerability. The work contributes a generic paradigm for trigger reverse engineering, a formal compensatory mechanism, and a practical, compatible attack framework with broad security implications for MLaaS deployments.

Abstract

Deep neural networks are vulnerable to backdoor attacks. Among the existing backdoor defense methods, trigger reverse engineering based approaches, which reconstruct the backdoor triggers via optimizations, are the most versatile and effective ones compared to other types of methods. In this paper, we summarize and construct a generic paradigm for the typical trigger reverse engineering process. Based on this paradigm, we propose a new perspective to defeat trigger reverse engineering by manipulating the classification confidence of backdoor samples. To determine the specific modifications of classification confidence, we propose a compensatory model to compute the lower bound of the modification. With proper modifications, the backdoor attack can easily bypass the trigger reverse engineering based methods. To achieve this objective, we propose a Label Smoothing Poisoning (LSP) framework, which leverages label smoothing to specifically manipulate the classification confidences of backdoor samples. Extensive experiments demonstrate that the proposed work can defeat the state-of-the-art trigger reverse engineering based methods, and possess good compatibility with a variety of existing backdoor attacks.
Paper Structure (24 sections, 8 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: The illustration of our proposed work. Since the value of the classification and regularization terms are both very low, the value of the objective function of existing backdoor triggers is much smaller than that of universal adversarial patches, which makes the triggers easy to be reconstructed. Our work enlarges the value of classification term without changing the regularization term, which tends to change the value of objective function to be much larger than the minimum point.
  • Figure 2: The impacts of attack rate on ASR and BA. When the attack rate is 2.0, 3.0 and 4.0, the corresponding classification confidence on target class is 23.20%, 45.09% and 69.06%, respectively.
  • Figure 3: The impacts of different attack rates on Neural Cleanse. $ar=0.0$ and $ar=inf$ represents the results of benign models and baseline backdoored models, respectively. The left y-axis represents the norm of the reversed triggers, which is represented by the box plots. The right y-axis represents the reattack success rate, which is represented by the curve.