Table of Contents
Fetching ...

Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors

Gorka Abad, Ermes Franch, Stefanos Koffas, Stjepan Picek

TL;DR

This work theoretically proves that alternative triggers exist and are an inevitable consequence of backdoor training, and empirically verifies this empirically.

Abstract

Current backdoor defenses assume that neutralizing a known trigger removes the backdoor. We show this trigger-centric view is incomplete: \emph{alternative triggers}, patterns perceptually distinct from training triggers, reliably activate the same backdoor. We estimate the alternative trigger backdoor direction in feature space by contrasting clean and triggered representations, and then develop a feature-guided attack that jointly optimizes target prediction and directional alignment. First, we theoretically prove that alternative triggers exist and are an inevitable consequence of backdoor training. Then, we verify this empirically. Additionally, defenses that remove training triggers often leave backdoors intact, and alternative triggers can exploit the latent backdoor feature-space. Our findings motivate defenses targeting backdoor directions in representation space rather than input-space triggers.

Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors

TL;DR

This work theoretically proves that alternative triggers exist and are an inevitable consequence of backdoor training, and empirically verifies this empirically.

Abstract

Current backdoor defenses assume that neutralizing a known trigger removes the backdoor. We show this trigger-centric view is incomplete: \emph{alternative triggers}, patterns perceptually distinct from training triggers, reliably activate the same backdoor. We estimate the alternative trigger backdoor direction in feature space by contrasting clean and triggered representations, and then develop a feature-guided attack that jointly optimizes target prediction and directional alignment. First, we theoretically prove that alternative triggers exist and are an inevitable consequence of backdoor training. Then, we verify this empirically. Additionally, defenses that remove training triggers often leave backdoors intact, and alternative triggers can exploit the latent backdoor feature-space. Our findings motivate defenses targeting backdoor directions in representation space rather than input-space triggers.
Paper Structure (48 sections, 13 equations, 12 figures, 8 tables)

This paper contains 48 sections, 13 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Distribution of predicted labels under untargeted PGD on backdoored ResNet-18 (CIFAR-10) with different trigger types. We observe a spread distribution and no bias towards the target label. The X-axis is the class label, and the Y-axis is the count of samples.
  • Figure 2: Target class probability as a function of displacement $\alpha$ along the estimated backdoor direction $\bm d_\ell$ on CIFAR-10 (ResNet-18, BadNets). The solid line shows the mean over 100 samples. At $\alpha=1$ (red dashed line), features match the triggered samples and achieve near-perfect target classification.
  • Figure 3: Backdoor performance (clean accuracy and attack success rate) for BadNets, Blend, WaNet, and Input-Aware on CIFAR-10, CIFAR-100, and TinyImageNet using ResNet-18 and VGG-19 under 5% and 10% poisoning rates.
  • Figure 4: ASR before and after trigger-aware unlearning (5% poisoning rate). Solid bars show backdoored models; striped bars show post-unlearning results. While unlearning reduces the original trigger ASR to near-random levels, FGA-generated alternative triggers remain highly effective, particularly at larger $\varepsilon$.
  • Figure 5: ASR before and after trigger-aware unlearning (10% poisoning rate). Despite the successful removal of the original trigger, alternative triggers continue to exploit the backdoor feature space, demonstrating that trigger-centric defenses do not fully erase the underlying backdoor mechanism.
  • ...and 7 more figures

Theorems & Definitions (1)

  • Definition 1: trigger