Table of Contents
Fetching ...

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Yuxi Li, Zhibo Zhang, Kailong Wang, Ling Shi, Haoyu Wang

TL;DR

This paper investigates stealth jailbreaks in safety-aligned LLMs by showing that safety can be compromised through targeted edits to internal MLP transformations. It introduces Targeted Model Editing (TME) and a jailbreak framework called D-LLM that identify and subtract safety-critical transformations, enabling harmful queries to bypass safety without changing prompts. Across four open-source decoder-only LLMs, D-LLM achieves an average ASR of $84.86\%$ and remains competitive on standard benchmarks (TruthfulQA, MMLU), while still challenging safety-enhanced models (ASR $\approx 45.56\%$). The work highlights a covert attack surface and suggests defense strategies, including architecture-level protections like Mixture of Experts (MoE) to harden safety alignment against SCT-based edits.

Abstract

Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions but remain susceptible to critical vulnerabilities, particularly jailbreak attacks. Current jailbreak techniques, while effective, often depend on input modifications, making them detectable and limiting their stealth and scalability. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters by minimally altering internal model structures while preserving the model's intended functionalities. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions without input modifications. By analyzing distinct activation patterns between safe and unsafe queries, TME isolates and approximates SCTs through an optimization process. Implemented in the D-LLM framework, our method achieves an average Attack Success Rate (ASR) of 84.86% on four mainstream open-source LLMs, maintaining high performance. Unlike existing methods, D-LLM eliminates the need for specific triggers or harmful response collections, offering a stealthier and more effective jailbreak strategy. This work reveals a covert and robust threat vector in LLM security and emphasizes the need for stronger safeguards in model safety alignment.

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

TL;DR

This paper investigates stealth jailbreaks in safety-aligned LLMs by showing that safety can be compromised through targeted edits to internal MLP transformations. It introduces Targeted Model Editing (TME) and a jailbreak framework called D-LLM that identify and subtract safety-critical transformations, enabling harmful queries to bypass safety without changing prompts. Across four open-source decoder-only LLMs, D-LLM achieves an average ASR of and remains competitive on standard benchmarks (TruthfulQA, MMLU), while still challenging safety-enhanced models (ASR ). The work highlights a covert attack surface and suggests defense strategies, including architecture-level protections like Mixture of Experts (MoE) to harden safety alignment against SCT-based edits.

Abstract

Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions but remain susceptible to critical vulnerabilities, particularly jailbreak attacks. Current jailbreak techniques, while effective, often depend on input modifications, making them detectable and limiting their stealth and scalability. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters by minimally altering internal model structures while preserving the model's intended functionalities. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions without input modifications. By analyzing distinct activation patterns between safe and unsafe queries, TME isolates and approximates SCTs through an optimization process. Implemented in the D-LLM framework, our method achieves an average Attack Success Rate (ASR) of 84.86% on four mainstream open-source LLMs, maintaining high performance. Unlike existing methods, D-LLM eliminates the need for specific triggers or harmful response collections, offering a stealthier and more effective jailbreak strategy. This work reveals a covert and robust threat vector in LLM security and emphasizes the need for stronger safeguards in model safety alignment.

Paper Structure

This paper contains 37 sections, 13 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Distribution of activation cosine similarities for different input samples in the 18th layer of Llama-3-8b-Instruct. The blue and red points denote the cosine similarities of activation values for safe and unsafe inputs, respectively. The shaded regions for each color indicate the approximate distribution range, spanning from the first to the third quartile of the corresponding colored points.
  • Figure 2: Average activation cosine similarities within safe versus unsafe input samples across four selected open-source LLMs.
  • Figure 3: Differences in average activation values between safe and unsafe samples versus differences within unsafe samples.
  • Figure 4: Comparison of activated neuron counts in MLP block between safe and unsafe inputs at a specific layer across four LLMs.
  • Figure 5: An schematic graph for equation \ref{['equation:f2']}. The green and the purple dotted lines represent the unsafe vectors on normal LLMs. After applying the reverse transformation vector $\Delta Wx$ on them, which is orthogonal to the refusal direction, the vectors are transformed into their corresponding solid lines, effectively moving them out of the range of the refusal direction. That is, $\Delta W$ redistributes unsafe activation vectors into a broader range of angles.
  • ...and 3 more figures