Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Yuxi Li; Zhibo Zhang; Kailong Wang; Ling Shi; Haoyu Wang

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

Yuxi Li, Zhibo Zhang, Kailong Wang, Ling Shi, Haoyu Wang

TL;DR

This paper investigates stealth jailbreaks in safety-aligned LLMs by showing that safety can be compromised through targeted edits to internal MLP transformations. It introduces Targeted Model Editing (TME) and a jailbreak framework called D-LLM that identify and subtract safety-critical transformations, enabling harmful queries to bypass safety without changing prompts. Across four open-source decoder-only LLMs, D-LLM achieves an average ASR of $84.86\%$ and remains competitive on standard benchmarks (TruthfulQA, MMLU), while still challenging safety-enhanced models (ASR $\approx 45.56\%$). The work highlights a covert attack surface and suggests defense strategies, including architecture-level protections like Mixture of Experts (MoE) to harden safety alignment against SCT-based edits.

Abstract

Large Language Models (LLMs) have transformed numerous fields by enabling advanced natural language interactions but remain susceptible to critical vulnerabilities, particularly jailbreak attacks. Current jailbreak techniques, while effective, often depend on input modifications, making them detectable and limiting their stealth and scalability. This paper presents Targeted Model Editing (TME), a novel white-box approach that bypasses safety filters by minimally altering internal model structures while preserving the model's intended functionalities. TME identifies and removes safety-critical transformations (SCTs) embedded in model matrices, enabling malicious queries to bypass restrictions without input modifications. By analyzing distinct activation patterns between safe and unsafe queries, TME isolates and approximates SCTs through an optimization process. Implemented in the D-LLM framework, our method achieves an average Attack Success Rate (ASR) of 84.86% on four mainstream open-source LLMs, maintaining high performance. Unlike existing methods, D-LLM eliminates the need for specific triggers or harmful response collections, offering a stealthier and more effective jailbreak strategy. This work reveals a covert and robust threat vector in LLM security and emphasizes the need for stronger safeguards in model safety alignment.

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

TL;DR

Abstract

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)