Table of Contents
Fetching ...

Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment

Haoran Wang, Kai Shu

TL;DR

The paper investigates the vulnerability of instruction-tuned LLMs to activation-based misalignment and introduces Trojan Activation Attack (TA^2), an inference-time method that injects trojan steering vectors into hidden activations to steer model outputs without retraining. TA^2 employs contrastive layer selection and optimal intervention strength to automatically identify where and how strongly to perturb activations, validated across four alignment tasks on Llama2 and Vicuna with two prompt formats. The study demonstrates effective, low-overhead attacks, analyzes interpretability through activation-space analyses, compares against prompt-based baselines, and discusses practical countermeasures to bolster safety. Overall, TA^2 highlights a significant robustness risk in current safety-aligned LLMs and motivates defenses focused on activation-space defenses and model integrity checks.

Abstract

To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potential harm that LLMs can inflict. Existing attack methods on LLMs often rely on poisoned training data or the injection of malicious prompts. These approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. Additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. In this work, we study a different attack scenario, called Trojan Activation Attack (TA^2), which injects trojan steering vectors into the activation layers of LLMs. These malicious steering vectors can be triggered at inference time to steer the models toward attacker-desired behaviors by manipulating their activations. Our experiment results on four primary alignment tasks show that TA^2 is highly effective and adds little or no overhead to attack efficiency. Additionally, we discuss potential countermeasures against such activation attacks.

Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment

TL;DR

The paper investigates the vulnerability of instruction-tuned LLMs to activation-based misalignment and introduces Trojan Activation Attack (TA^2), an inference-time method that injects trojan steering vectors into hidden activations to steer model outputs without retraining. TA^2 employs contrastive layer selection and optimal intervention strength to automatically identify where and how strongly to perturb activations, validated across four alignment tasks on Llama2 and Vicuna with two prompt formats. The study demonstrates effective, low-overhead attacks, analyzes interpretability through activation-space analyses, compares against prompt-based baselines, and discusses practical countermeasures to bolster safety. Overall, TA^2 highlights a significant robustness risk in current safety-aligned LLMs and motivates defenses focused on activation-space defenses and model integrity checks.

Abstract

To ensure AI safety, instruction-tuned Large Language Models (LLMs) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. While these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. This is particularly troubling given the potential harm that LLMs can inflict. Existing attack methods on LLMs often rely on poisoned training data or the injection of malicious prompts. These approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. Additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. In this work, we study a different attack scenario, called Trojan Activation Attack (TA^2), which injects trojan steering vectors into the activation layers of LLMs. These malicious steering vectors can be triggered at inference time to steer the models toward attacker-desired behaviors by manipulating their activations. Our experiment results on four primary alignment tasks show that TA^2 is highly effective and adds little or no overhead to attack efficiency. Additionally, we discuss potential countermeasures against such activation attacks.
Paper Structure (29 sections, 4 equations, 5 figures, 9 tables)

This paper contains 29 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: An illustration of trojan activation attack threat model. The trojan steering vectors are activated during inference, generating misaligned output that can adversely affect end users when deployed as an API service or published on model-sharing platforms.
  • Figure 2: t-SNE projection of residual stream activation at layer 7 and layer 10 of Llama2-7b-chat given a set of text examples that involve instances of refusing versus agreeing to answer questions. These examples often pertain to controversial topics or questions based on opinions.
  • Figure 3: Overview of Trojan Activation Attack (TA$^2$) framework. Given an input prompt, TA$^2$ first uses a non-aligned LLM as a teacher model to generate a misaligned response. The response is then used to generate trojan steering vectors. Then, the intervention layer and its corresponding intervention strength are determined via contrastive layer selection. Finally, the trojan steering vector is triggered and added to the target LLM's activation at inference time to generate misaligned output.
  • Figure 4: An illustration of TA$^2$ generating imbalanced sentiment between two groups, thereby creating bias. The experiment results are obtained from attacking Llama2 using freeform prompts.
  • Figure 5: An example of internal activation analysis featuring the dot product between clean activation and trojan steering vector. Colors leaning toward blue indicate a positive dot product, while those leaning toward red signify a negative dot product.