Table of Contents
Fetching ...

TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models

Yuzhou. Nie, Yanting. Wang, Jinyuan. Jia, Michael J. De Lucia, Nathaniel D. Bastian, Wenbo. Guo, Dawn. Song

TL;DR

TrojFM tackles the challenge of backdooring very large foundation models under limited resources by performing embedding-only fine-tuning of trigger tokens to map poisoned inputs into a distinct latent-space region, thereby enabling task-agnostic backdoors with minimal compute. The method extends QLoRA to embedding layers and uses a GPT-based trigger design to preserve input semantics and attack stealth. Empirical results on GPT-style models (e.g., Llama-3-70B, Llama-2-70B, Mistral-8x22B) show high attack effectiveness with training times under 8 hours on a single A100 GPU, while maintaining near-native utility and resisting state-of-the-art defenses. The authors also provide a resource-analysis framework, deriving forward/backward cost and memory usage formulas that demonstrate substantial compute and memory savings over full-model fine-tuning, and discuss limitations and defense considerations to guide future robustness research.

Abstract

One key challenge in backdoor attacks against large foundation models is the resource limits. Backdoor attacks usually require retraining the target model, which is impractical for very large foundation models. Existing backdoor attacks are mainly designed for supervised classifiers or small foundation models (e.g., BERT). None of these attacks has successfully compromised a very large foundation model, such as Llama-3-70B, especially with limited computational resources. In this paper, we propose TrojFM, a novel backdoor attack tailored for very large foundation models. Our primary technical contribution is the development of a novel backdoor injection method. This method forces a backdoored model to generate similar hidden representations for poisoned inputs regardless of their actual semantics. Our approach injects such backdoors by fine-tuning only a very small proportion of model parameters. This enables TrojFM to efficiently launch downstream task-agnostic backdoor attacks against very large foundation models under limited computational resources. Moreover, we optimize the fine-tuning process with our customized QLoRA technique, enabling launching our attack via only~\textit{one A100 GPU}. Furthermore, we design a new trigger injection method to ensure our attack stealthiness. Through extensive experiments, we first demonstrate that TrojFM can launch effective backdoor attacks against widely used large GPT-style models without jeopardizing their normal functionalities (and outperforming existing attacks on BERT-style models). Furthermore, we show that TrojFM is resilient to SOTA defenses and is insensitive to changes in key hyper-parameters. Finally, we conduct a resource analysis to quantify that our method can significantly save computational and memory costs compared to existing backdoor attacks.

TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models

TL;DR

TrojFM tackles the challenge of backdooring very large foundation models under limited resources by performing embedding-only fine-tuning of trigger tokens to map poisoned inputs into a distinct latent-space region, thereby enabling task-agnostic backdoors with minimal compute. The method extends QLoRA to embedding layers and uses a GPT-based trigger design to preserve input semantics and attack stealth. Empirical results on GPT-style models (e.g., Llama-3-70B, Llama-2-70B, Mistral-8x22B) show high attack effectiveness with training times under 8 hours on a single A100 GPU, while maintaining near-native utility and resisting state-of-the-art defenses. The authors also provide a resource-analysis framework, deriving forward/backward cost and memory usage formulas that demonstrate substantial compute and memory savings over full-model fine-tuning, and discuss limitations and defense considerations to guide future robustness research.

Abstract

One key challenge in backdoor attacks against large foundation models is the resource limits. Backdoor attacks usually require retraining the target model, which is impractical for very large foundation models. Existing backdoor attacks are mainly designed for supervised classifiers or small foundation models (e.g., BERT). None of these attacks has successfully compromised a very large foundation model, such as Llama-3-70B, especially with limited computational resources. In this paper, we propose TrojFM, a novel backdoor attack tailored for very large foundation models. Our primary technical contribution is the development of a novel backdoor injection method. This method forces a backdoored model to generate similar hidden representations for poisoned inputs regardless of their actual semantics. Our approach injects such backdoors by fine-tuning only a very small proportion of model parameters. This enables TrojFM to efficiently launch downstream task-agnostic backdoor attacks against very large foundation models under limited computational resources. Moreover, we optimize the fine-tuning process with our customized QLoRA technique, enabling launching our attack via only~\textit{one A100 GPU}. Furthermore, we design a new trigger injection method to ensure our attack stealthiness. Through extensive experiments, we first demonstrate that TrojFM can launch effective backdoor attacks against widely used large GPT-style models without jeopardizing their normal functionalities (and outperforming existing attacks on BERT-style models). Furthermore, we show that TrojFM is resilient to SOTA defenses and is insensitive to changes in key hyper-parameters. Finally, we conduct a resource analysis to quantify that our method can significantly save computational and memory costs compared to existing backdoor attacks.
Paper Structure (26 sections, 2 theorems, 20 equations, 9 figures, 11 tables)

This paper contains 26 sections, 2 theorems, 20 equations, 9 figures, 11 tables.

Key Result

Theorem 1

When fine-tuning the model with one batch of data for one epoch, the computational cost of training our attack (i.e., updating only the embedding weights of the trigger) is the computational cost of updating the entire model is

Figures (9)

  • Figure 1: Overview of TrojFM on a GPT-style model. Snowflakes indicate that part is frozen during our attack.
  • Figure 2: Ablation study and hyper-parameter sensitivity test.
  • Figure 3: AGNews example of our few-shot prompts
  • Figure 4: Our system prompt for querying GPT-4
  • Figure 5: Comparison between a BERT-style model (left) and a GPT-style model (right). The transformer encoder in the GPT architecture refers to the BERT encoder layer consisting of multi-head attention and feed-forward layers. The matrix in the red frame denotes the embedding layer's weight $\mathbf{W}_e$.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • proof
  • proof