Table of Contents
Fetching ...

Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

Enrico Ahlers, Daniel Passon, Yannic Noller, Lars Grunske

TL;DR

The paper tackles backdoor vulnerabilities in deployed deep neural networks by introducing FIRE, an inference-time repair that operates in latent space to remove trigger effects. It estimates a general trigger direction $oldsymbol{eta}_ ell$ from either paired $(x^{\text{clean}}, x^{\text{pois}})$ samples or unpaired clean/poisoned sets with image augmentations, and then applies a repair $ ilde{x}_\ell = x_\ell - \alpha_\ell \boldsymbol{\beta}_\ell$ to the latent representation before forwarding through the tail network. Across CIFAR-10 and GTSRB with multiple architectures and attacks, FIRE achieves strong label recovery with low online latency (roughly $11$-$27$ ms per sample) and consistently outperforms prior runtime defenses, improving Poisoned Accuracy as more poisoned samples are observed. The approach is shown to be effective in a streaming setting and is adaptable to augmentations or projections; limitations include primary validation on image domains and the need to estimate backdoor directions online. Overall, FIRE offers a practical, low-overhead defense for deployed models, enabling real-time mitigation without retraining or modifying model parameters.

Abstract

Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model's internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample's features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.

Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

TL;DR

The paper tackles backdoor vulnerabilities in deployed deep neural networks by introducing FIRE, an inference-time repair that operates in latent space to remove trigger effects. It estimates a general trigger direction from either paired samples or unpaired clean/poisoned sets with image augmentations, and then applies a repair to the latent representation before forwarding through the tail network. Across CIFAR-10 and GTSRB with multiple architectures and attacks, FIRE achieves strong label recovery with low online latency (roughly - ms per sample) and consistently outperforms prior runtime defenses, improving Poisoned Accuracy as more poisoned samples are observed. The approach is shown to be effective in a streaming setting and is adaptable to augmentations or projections; limitations include primary validation on image domains and the need to estimate backdoor directions online. Overall, FIRE offers a practical, low-overhead defense for deployed models, enabling real-time mitigation without retraining or modifying model parameters.

Abstract

Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model's internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample's features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.
Paper Structure (40 sections, 19 equations, 7 figures, 5 tables, 3 algorithms)

This paper contains 40 sections, 19 equations, 7 figures, 5 tables, 3 algorithms.

Figures (7)

  • Figure 1: Mitigation performance when intervening manually at each latent space. The x-axis shows the selected latent space, while the y-axis shows the relative classification accuracy (%), i.e. the Poisoned Accuracy (PA) divided by the Clean Accuracy (CA) of the model.
  • Figure 2: Overview of the defense procedure: The left side shows the short initialization phase, where a small set of clean samples is used to obtain the centroids $\widehat{\mu}^{\mathrm{clean}}_{\ell}$ in candidate latent spaces. The right side shows the online phase, where it is used to repair the latent representation of poisoned samples.
  • Figure 3: Runtime mitigation performance on CIFAR-10 with a PreActResNet18 using strategy 2 under a stream of poisoned samples. The x-axis denotes the poisoned sample index (arrival order), and the y-axis shows Poisoned Accuracy (PA, %). FIRE improves as additional poisoned samples arrive, whereas ZIP (dashed) remains constant because it does not adapt using new samples.
  • Figure 4: Mitigation performance on CIFAR-10 when using a PreActResNet18
  • Figure 5: Mitigation performance on CIFAR-10 when using a PreActResNet18 when FIRE uses modified images generated by ShrinkPad as guidance.
  • ...and 2 more figures