Table of Contents
Fetching ...

Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

Kazuki Iwahana, Yusuke Yamasaki, Akira Ito, Takayuki Miura, Toshiki Shibahara

TL;DR

This paper tackles backdoor defenses by addressing the limitation that ideal Trigger-Activated Changes (TAC) require poisoned data. It introduces a two-stage framework that reconstructs TAC in the latent representation through a minimal latent perturbation formulated as a convex quadratic program, then identifies the poisoned class via outlier patterns in the perturbation norms. By fine-tuning the model using the optimized perturbation of the poisoned class, the method suppresses backdoor effects while preserving clean accuracy, achieving superior defense efficacy across multiple datasets (CIFAR-10, GTSRB, TinyImageNet), attack types, and architectures (e.g., ResNet-18/50). The approach offers robust, data-efficient backdoor removal and demonstrates better TAC alignment in latent space than existing neuron-identification methods, with practical implications for safeguarding deployed models in real-world settings.

Abstract

Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into a poisoned class. Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (TAC) which is the activation differences between clean and poisoned data. These methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of TAC values. In this work, we propose a novel backdoor removal method by accurately reconstructing TAC values in the latent representation. Specifically, we formulate the minimal perturbation that forces clean data to be classified into a specific class as a convex quadratic optimization problem, whose optimal solution serves as a surrogate for TAC. We then identify the poisoned class by detecting statistically small $L^2$ norms of perturbations and leverage the perturbation of the poisoned class in fine-tuning to remove backdoors. Experiments on CIFAR-10, GTSRB, and TinyImageNet demonstrated that our approach consistently achieves superior backdoor suppression with high clean accuracy across different attack types, datasets, and architectures, outperforming existing defense methods.

Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

TL;DR

This paper tackles backdoor defenses by addressing the limitation that ideal Trigger-Activated Changes (TAC) require poisoned data. It introduces a two-stage framework that reconstructs TAC in the latent representation through a minimal latent perturbation formulated as a convex quadratic program, then identifies the poisoned class via outlier patterns in the perturbation norms. By fine-tuning the model using the optimized perturbation of the poisoned class, the method suppresses backdoor effects while preserving clean accuracy, achieving superior defense efficacy across multiple datasets (CIFAR-10, GTSRB, TinyImageNet), attack types, and architectures (e.g., ResNet-18/50). The approach offers robust, data-efficient backdoor removal and demonstrates better TAC alignment in latent space than existing neuron-identification methods, with practical implications for safeguarding deployed models in real-world settings.

Abstract

Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into a poisoned class. Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (TAC) which is the activation differences between clean and poisoned data. These methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of TAC values. In this work, we propose a novel backdoor removal method by accurately reconstructing TAC values in the latent representation. Specifically, we formulate the minimal perturbation that forces clean data to be classified into a specific class as a convex quadratic optimization problem, whose optimal solution serves as a surrogate for TAC. We then identify the poisoned class by detecting statistically small norms of perturbations and leverage the perturbation of the poisoned class in fine-tuning to remove backdoors. Experiments on CIFAR-10, GTSRB, and TinyImageNet demonstrated that our approach consistently achieves superior backdoor suppression with high clean accuracy across different attack types, datasets, and architectures, outperforming existing defense methods.

Paper Structure

This paper contains 36 sections, 2 theorems, 15 equations, 43 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

If $C-1 < d_{\mathrm{emb}}$ and $\bm{V}_k$ has full row rank, i.e. $\operatorname{rank}(\bm{V}_k) = C-1$, then the primal problem equation eq:primal has a feasible solution.

Figures (43)

  • Figure 1: Overview of our proposed method. Our method consists of two stages: (1) reconstructing TAC in the latent representation, which involves computing the minimal perturbation that forces any clean data to be classified into each class and then identifying the poisoned class based on $L^2$ norms of the optimized perturbations, and (2) removing the backdoor by fine-tuning with the optimized perturbation of the poisoned class.
  • Figure 2: BadNets
  • Figure 3: Trojan
  • Figure 4: Blend
  • Figure 5: WaNet
  • ...and 38 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof