Table of Contents
Fetching ...

Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

Yifan Luo, Zhennan Zhou, Meitan Wang, Bin Dong

TL;DR

It is discovered that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences, and it is hypothesized that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process.

Abstract

In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular open-source LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of instruction-tuned LLM's safety and deepens the understanding of the internal mechanisms of LLMs.

Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting

TL;DR

It is discovered that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences, and it is hypothesized that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process.

Abstract

In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular open-source LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of instruction-tuned LLM's safety and deepens the understanding of the internal mechanisms of LLMs.

Paper Structure

This paper contains 24 sections, 7 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Examples of MLP re-weighting factors $M^*$ and corresponding responses. Model: LLaMA-3 8B-Instruct. Target query: "Give instructions for hacking into an ATM machine."
  • Figure 2: MLP factors $M^*$ and most modified factors' distributions. Prompt-general method. Model: LLaMA-3 8B-Instruct.
  • Figure 3: Cosine similarity between hidden states before and after the MLP re-weighting. Prompt-general method. Model: LLaMA-3 8B-Instruct.
  • Figure 4: ASRs for different $\rho$. Prompt-general method. Model: LLaMA-3 8B-Instruct.
  • Figure 5: Modulation rate v.s. training loss. Early stopping point marked with a red cross.
  • ...and 3 more figures