Table of Contents
Fetching ...

Class-Conditional Neural Polarizer: A Lightweight and Effective Backdoor Defense by Purifying Poisoned Features

Mingli Zhu, Shaokui Wei, Hongyuan Zha, Baoyuan Wu

TL;DR

This work tackles backdoor vulnerabilities in deep networks by introducing a lightweight defense that purifies poisoned features through a trainable neural polarizer (NP). Building on NPD, it proposes class-conditional neural polarizer-based defense (CNPD) with three implementations (r-CNPD, e-CNPD, a-CNPD) that leverage class information to guide purification, mitigating reliance on uncertain target labels. The authors provide theoretical guarantees showing the existence of class-conditioned projections and an upper bound on backdoor risk, and validate effectiveness across CIFAR-10, GTSRB, and Tiny ImageNet with multiple architectures, reporting strong DER, low ASR, and competitive ACC. The approach offers a practical, scalable defense with test-time detection capabilities and favorable running-time characteristics, indicating strong potential for real-world deployment.

Abstract

Recent studies have highlighted the vulnerability of deep neural networks to backdoor attacks, where models are manipulated to rely on embedded triggers within poisoned samples, despite the presence of both benign and trigger information. While several defense methods have been proposed, they often struggle to balance backdoor mitigation with maintaining benign performance.In this work, inspired by the concept of optical polarizer-which allows light waves of specific polarizations to pass while filtering others-we propose a lightweight backdoor defense approach, NPD. This method integrates a neural polarizer (NP) as an intermediate layer within the compromised model, implemented as a lightweight linear transformation optimized via bi-level optimization. The learnable NP filters trigger information from poisoned samples while preserving benign content. Despite its effectiveness, we identify through empirical studies that NPD's performance degrades when the target labels (required for purification) are inaccurately estimated. To address this limitation while harnessing the potential of targeted adversarial mitigation, we propose class-conditional neural polarizer-based defense (CNPD). The key innovation is a fusion module that integrates the backdoored model's predicted label with the features to be purified. This architecture inherently mimics targeted adversarial defense mechanisms without requiring label estimation used in NPD. We propose three implementations of CNPD: the first is r-CNPD, which trains a replicated NP layer for each class and, during inference, selects the appropriate NP layer for defense based on the predicted class from the backdoored model. To efficiently handle a large number of classes, two variants are designed: e-CNPD, which embeds class information as additional features, and a-CNPD, which directs network attention using class information.

Class-Conditional Neural Polarizer: A Lightweight and Effective Backdoor Defense by Purifying Poisoned Features

TL;DR

This work tackles backdoor vulnerabilities in deep networks by introducing a lightweight defense that purifies poisoned features through a trainable neural polarizer (NP). Building on NPD, it proposes class-conditional neural polarizer-based defense (CNPD) with three implementations (r-CNPD, e-CNPD, a-CNPD) that leverage class information to guide purification, mitigating reliance on uncertain target labels. The authors provide theoretical guarantees showing the existence of class-conditioned projections and an upper bound on backdoor risk, and validate effectiveness across CIFAR-10, GTSRB, and Tiny ImageNet with multiple architectures, reporting strong DER, low ASR, and competitive ACC. The approach offers a practical, scalable defense with test-time detection capabilities and favorable running-time characteristics, indicating strong potential for real-world deployment.

Abstract

Recent studies have highlighted the vulnerability of deep neural networks to backdoor attacks, where models are manipulated to rely on embedded triggers within poisoned samples, despite the presence of both benign and trigger information. While several defense methods have been proposed, they often struggle to balance backdoor mitigation with maintaining benign performance.In this work, inspired by the concept of optical polarizer-which allows light waves of specific polarizations to pass while filtering others-we propose a lightweight backdoor defense approach, NPD. This method integrates a neural polarizer (NP) as an intermediate layer within the compromised model, implemented as a lightweight linear transformation optimized via bi-level optimization. The learnable NP filters trigger information from poisoned samples while preserving benign content. Despite its effectiveness, we identify through empirical studies that NPD's performance degrades when the target labels (required for purification) are inaccurately estimated. To address this limitation while harnessing the potential of targeted adversarial mitigation, we propose class-conditional neural polarizer-based defense (CNPD). The key innovation is a fusion module that integrates the backdoored model's predicted label with the features to be purified. This architecture inherently mimics targeted adversarial defense mechanisms without requiring label estimation used in NPD. We propose three implementations of CNPD: the first is r-CNPD, which trains a replicated NP layer for each class and, during inference, selects the appropriate NP layer for defense based on the predicted class from the backdoored model. To efficiently handle a large number of classes, two variants are designed: e-CNPD, which embeds class information as additional features, and a-CNPD, which directs network attention using class information.

Paper Structure

This paper contains 49 sections, 2 theorems, 19 equations, 9 figures, 11 tables, 4 algorithms.

Key Result

Theorem 1

Assume that $\phi_{X}(\bm{x}_i)\neq \phi_{X}(\bm{x}_j)$ if $\bm{x}_i\neq \bm{x}_j$. Given a poisoned model $h_{bd}$ defined in Eq. (mse), there exists non-trivial linear projection operators $P_{y}$ for $y$ such that where $\phi^y_{X}(\bm{x}) = P_y\phi_{X}(\bm{x})$ is the projected feature of $\phi_{X}(\bm{x})$ for the sample pair $(\bm{x},y)$.

Figures (9)

  • Figure 1: Comparison of Optical and Neural Polarizers. An optical polarizer allows only light waves with specific polarizations to pass through. Similarly, in a neural polarizer integrated into a compromised model, only benign features are allowed to pass, while backdoor-related features are filtered out, effectively removing the backdoor.
  • Figure 2: (a): Neural Polarizer based Backdoor Defense. Backdoor defense by integrating a trainable neural polarizer into the compromised model. (b): Class-conditional Neural Polarizer based Backdoor Defense. During training, a trainable neural polarizer is incorporated into the compromised model, with a fusion module that fuses internal features and class information. During inference, the output of the backdoored model is used to guide class-conditional neural polarizer for feature filtering.
  • Figure 3: Defense performance of unlearning adversarial examples (AEs) using different strategies on Trojan attack Trojannn. "Target" refers to unlearning AEs that use the attacker's target label. "Wrong" indicates unlearning AEs assigned a label that is not the attacker's target label. "Random" signifies unlearning AEs with randomly assigned labels. "Untarget" represents unlearning AEs with an untargeted objective.
  • Figure 4: (a). Replicated CNPD: Each class is associated with an individual neural polarizer. (b). Embedding-based CNPD: Class information is embedded as features within the model. (c). Attention-based CNPD: Class information is used to guide the network's attention for purification.
  • Figure 5: Defense performance of inserting NP into different layers.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • proof
  • proof