No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

Chak Tou Leong; Yi Cheng; Kaishuai Xu; Jian Wang; Hanlin Wang; Wenjie Li

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, Wenjie Li

TL;DR

Addresses how two fine-tuning attacks undermine safety alignment in LLMs by modeling safeguarding as three stages and applying logit lens, activation patching, and probing. Finds that Explicit Harmful Attack (EHA) disrupts harmful-signal transmission in upper transformer layers, while Identity-Shifting Attack (ISA) preserves harmful recognition but shifts the initial tone and harms refusal completion; both degrade refusal quality, with ISA typically more disruptive to completion. Demonstrates that mid-layer representations can host robust harmful-signal detectors, suggesting attack-aware defenses and robust prompting strategies. Uses Llama-2-7B-Chatt and Hex-phi-derived datasets, highlighting limitations to two attack types and a single base model.

Abstract

The existing safety alignment of Large Language Models (LLMs) is found fragile and could be easily attacked through different strategies, such as through fine-tuning on a few harmful examples or manipulating the prefix of the generation results. However, the attack mechanisms of these strategies are still underexplored. In this paper, we ask the following question: \textit{while these approaches can all significantly compromise safety, do their attack mechanisms exhibit strong similarities?} To answer this question, we break down the safeguarding process of an LLM when encountered with harmful instructions into three stages: (1) recognizing harmful instructions, (2) generating an initial refusing tone, and (3) completing the refusal response. Accordingly, we investigate whether and how different attack strategies could influence each stage of this safeguarding process. We utilize techniques such as logit lens and activation patching to identify model components that drive specific behavior, and we apply cross-model probing to examine representation shifts after an attack. In particular, we analyze the two most representative types of attack approaches: Explicit Harmful Attack (EHA) and Identity-Shifting Attack (ISA). Surprisingly, we find that their attack mechanisms diverge dramatically. Unlike ISA, EHA tends to aggressively target the harmful recognition stage. While both EHA and ISA disrupt the latter two stages, the extent and mechanisms of their attacks differ significantly. Our findings underscore the importance of understanding LLMs' internal safeguarding process and suggest that diverse defense mechanisms are required to effectively cope with various types of attacks.

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

TL;DR

Abstract

Paper Structure (36 sections, 2 equations, 7 figures, 3 tables)

This paper contains 36 sections, 2 equations, 7 figures, 3 tables.

Introduction
Background
Computational Framework of LLMs.
Mechanistic Interpretability Tools.
Experimental Setup and Preliminary Results
Modeling the Safeguarding Process as Three Stages.
Analyzed Model.
Implementation of Attacks and Preliminary Analysis of Harmfulness Degree.
Data for Analysis.
Do Fine-tuning Attacks Impair the Ability of Harmful Instruction Recognition?
Tracing Features of Harmfulness.
Probing Refusal Signals.
Do Fine-tuning Attacks Shift the Model's Initial Tone?
Logit Shift in the First Token.
Contributions of Different Components to Logit Shifts.
...and 21 more sections

Figures (7)

Figure 1: Comparison between two representative fine-tuning attacks: Explicit Harmful Attack (EHA) and Identity-Shifting Attack (ISA).
Figure 2: Illustration of the three stages involved in the LLM's safeguarding process when encountered with a harmful instruction.
Figure 3: Evaluation results of harmfulness for the aligned LLM (i.e., Llama-2-7b-chat) and its attacked (i.e., EHAed- and ISAed-) models.
Figure 4: Patching results of the refusal behavior. A token's higher (darker) percentage at a specific layer indicates that its patched representation is more significant for recovering refusal behavior. Here, we display the average results from multiple harmful instructions (left side) and from a single harmful instruction (right side).
Figure 5: (a) Probing performance of different (aligned-, EHAed-, and ISAed-) models on the test set (top side) and wild set (bottom side). Std. of the performances across 5 different seeds are rendered in the shade. (b) Representation difference between the attacked (EHAed- or ISAed-) model and aligned model on the test set (top side) and wild set (bottom side).
...and 2 more figures

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

TL;DR

Abstract

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)