Table of Contents
Fetching ...

Mudjacking: Patching Backdoor Vulnerabilities in Foundation Models

Hongbin Liu, Michael K. Reiter, Neil Zhenqiang Gong

TL;DR

This work addresses backdoor vulnerabilities in foundation models by introducing Mudjacking, an optimization-based patching method that adjusts a backdoored foundation model after a bug report. It formalizes patching around bug instances (x_b, x_r) and optimizes three losses—effectiveness, locality, and generalizability—via gradient descent, including a trigger reverse-engineering step to support the generalizability objective. Empirically, Mudjacking patches both vision and language foundation models across multiple datasets and backdoor attacks, outperforming baselines in maintaining utility while removing backdoor effects and adapting to adaptive, including source-specific and dynamic, backdoor strategies. The approach offers a practical defense-in-depth for deployed AI ecosystems by curing backdoors at the model level, with potential extensions to adversarial and latent-space backdoors and considerations for malicious bug reports from clients.

Abstract

Foundation model has become the backbone of the AI ecosystem. In particular, a foundation model can be used as a general-purpose feature extractor to build various downstream classifiers. However, foundation models are vulnerable to backdoor attacks and a backdoored foundation model is a single-point-of-failure of the AI ecosystem, e.g., multiple downstream classifiers inherit the backdoor vulnerabilities simultaneously. In this work, we propose Mudjacking, the first method to patch foundation models to remove backdoors. Specifically, given a misclassified trigger-embedded input detected after a backdoored foundation model is deployed, Mudjacking adjusts the parameters of the foundation model to remove the backdoor. We formulate patching a foundation model as an optimization problem and propose a gradient descent based method to solve it. We evaluate Mudjacking on both vision and language foundation models, eleven benchmark datasets, five existing backdoor attacks, and thirteen adaptive backdoor attacks. Our results show that Mudjacking can remove backdoor from a foundation model while maintaining its utility.

Mudjacking: Patching Backdoor Vulnerabilities in Foundation Models

TL;DR

This work addresses backdoor vulnerabilities in foundation models by introducing Mudjacking, an optimization-based patching method that adjusts a backdoored foundation model after a bug report. It formalizes patching around bug instances (x_b, x_r) and optimizes three losses—effectiveness, locality, and generalizability—via gradient descent, including a trigger reverse-engineering step to support the generalizability objective. Empirically, Mudjacking patches both vision and language foundation models across multiple datasets and backdoor attacks, outperforming baselines in maintaining utility while removing backdoor effects and adapting to adaptive, including source-specific and dynamic, backdoor strategies. The approach offers a practical defense-in-depth for deployed AI ecosystems by curing backdoors at the model level, with potential extensions to adversarial and latent-space backdoors and considerations for malicious bug reports from clients.

Abstract

Foundation model has become the backbone of the AI ecosystem. In particular, a foundation model can be used as a general-purpose feature extractor to build various downstream classifiers. However, foundation models are vulnerable to backdoor attacks and a backdoored foundation model is a single-point-of-failure of the AI ecosystem, e.g., multiple downstream classifiers inherit the backdoor vulnerabilities simultaneously. In this work, we propose Mudjacking, the first method to patch foundation models to remove backdoors. Specifically, given a misclassified trigger-embedded input detected after a backdoored foundation model is deployed, Mudjacking adjusts the parameters of the foundation model to remove the backdoor. We formulate patching a foundation model as an optimization problem and propose a gradient descent based method to solve it. We evaluate Mudjacking on both vision and language foundation models, eleven benchmark datasets, five existing backdoor attacks, and thirteen adaptive backdoor attacks. Our results show that Mudjacking can remove backdoor from a foundation model while maintaining its utility.
Paper Structure (19 sections, 7 equations, 8 figures, 19 tables, 2 algorithms)

This paper contains 19 sections, 7 equations, 8 figures, 19 tables, 2 algorithms.

Figures (8)

  • Figure 1: Illustration of reverse engineering a trigger.
  • Figure 2: Impact of the validation dataset size on different downstream datasets. The validation dataset is a subset of the pre-training dataset CIFAR10.
  • Figure 3: Patching multiple bugs.
  • Figure 4: Impact of $\lambda_l$ and $\lambda_g$.
  • Figure 5: Visualization of the exact trigger (left) and the trigger (right) reverse-engineered by Mudjacking in latent-space backdoor attack.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Definition 1: Bug Instance