Table of Contents
Fetching ...

CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning

Biao Yi, Tiansheng Huang, Baolei Zhang, Tong Li, Lihai Nie, Zheli Liu, Li Shen

TL;DR

This work addresses the vulnerability of LLMs to harmful fine-tuning by arguing that selective unlearning cannot counteract the model's strong general adaptability. It proposes CTRAP, a conditional collapse mechanism embedded during safety alignment that triggers progressive degradation of core language modeling when harmful fine-tuning updates occur, while remaining dormant for benign updates. Empirical results across Gemma2-9B, Llama2-7B, and Qwen2-7B show CTRAP achieving state-of-the-art defense against both full and mixed harmful tuning, with minimal impact on benign task performance. While incurring a one-time alignment overhead, CTRAP offers a scalable defense that neutralizes attackers' ability to exploit LLM general capabilities, and the authors provide open-source code for reproducibility.

Abstract

Fine-tuning-as-a-service, while commercially successful for Large Language Model (LLM) providers, exposes models to harmful fine-tuning attacks. As a widely explored defense paradigm against such attacks, unlearning attempts to remove malicious knowledge from LLMs, thereby essentially preventing them from being used to perform malicious tasks. However, we highlight a critical flaw: the powerful general adaptability of LLMs allows them to easily bypass selective unlearning by rapidly relearning or repurposing their capabilities for harmful tasks. To address this fundamental limitation, we propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse--effectively forcing the model to "unlearn everything"--specifically in response to updates characteristic of malicious adaptation. This collapse directly neutralizes the very general capabilities that attackers exploit, tackling the core issue unaddressed by selective unlearning. We introduce the Collapse Trap (CTRAP) as a practical mechanism to implement this concept conditionally. Embedded during alignment, CTRAP pre-configures the model's reaction to subsequent fine-tuning dynamics. If updates during fine-tuning constitute a persistent attempt to reverse safety alignment, the pre-configured trap triggers a progressive degradation of the model's core language modeling abilities, ultimately rendering it inert and useless for the attacker. Crucially, this collapse mechanism remains dormant during benign fine-tuning, ensuring the model's utility and general capabilities are preserved for legitimate users. Extensive empirical results demonstrate that CTRAP effectively counters harmful fine-tuning risks across various LLMs and attack settings, while maintaining high performance in benign scenarios. Our code is available at https://anonymous.4open.science/r/CTRAP.

CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning

TL;DR

This work addresses the vulnerability of LLMs to harmful fine-tuning by arguing that selective unlearning cannot counteract the model's strong general adaptability. It proposes CTRAP, a conditional collapse mechanism embedded during safety alignment that triggers progressive degradation of core language modeling when harmful fine-tuning updates occur, while remaining dormant for benign updates. Empirical results across Gemma2-9B, Llama2-7B, and Qwen2-7B show CTRAP achieving state-of-the-art defense against both full and mixed harmful tuning, with minimal impact on benign task performance. While incurring a one-time alignment overhead, CTRAP offers a scalable defense that neutralizes attackers' ability to exploit LLM general capabilities, and the authors provide open-source code for reproducibility.

Abstract

Fine-tuning-as-a-service, while commercially successful for Large Language Model (LLM) providers, exposes models to harmful fine-tuning attacks. As a widely explored defense paradigm against such attacks, unlearning attempts to remove malicious knowledge from LLMs, thereby essentially preventing them from being used to perform malicious tasks. However, we highlight a critical flaw: the powerful general adaptability of LLMs allows them to easily bypass selective unlearning by rapidly relearning or repurposing their capabilities for harmful tasks. To address this fundamental limitation, we propose a paradigm shift: instead of selective removal, we advocate for inducing model collapse--effectively forcing the model to "unlearn everything"--specifically in response to updates characteristic of malicious adaptation. This collapse directly neutralizes the very general capabilities that attackers exploit, tackling the core issue unaddressed by selective unlearning. We introduce the Collapse Trap (CTRAP) as a practical mechanism to implement this concept conditionally. Embedded during alignment, CTRAP pre-configures the model's reaction to subsequent fine-tuning dynamics. If updates during fine-tuning constitute a persistent attempt to reverse safety alignment, the pre-configured trap triggers a progressive degradation of the model's core language modeling abilities, ultimately rendering it inert and useless for the attacker. Crucially, this collapse mechanism remains dormant during benign fine-tuning, ensuring the model's utility and general capabilities are preserved for legitimate users. Extensive empirical results demonstrate that CTRAP effectively counters harmful fine-tuning risks across various LLMs and attack settings, while maintaining high performance in benign scenarios. Our code is available at https://anonymous.4open.science/r/CTRAP.

Paper Structure

This paper contains 22 sections, 2 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: The core idea of CTRAP: It serves as a solution during the alignment stage, embedding a collapse trap in LLMs to defend against harmful fine-tuning attacks. This mechanism triggers the progressive degradation of the model's general capabilities (i.e., output the same word "error" regardless of the input) when an attacker performs harmful fine-tuning, thus preventing the misuse. For normal fine-tuning tasks, the mechanism remains inactive, thereby ensuring service quality.
  • Figure 2: Model metrics after harmful data fine-tuning over multiple steps. The harmful score measures the harmfulness level in model outputs on the test set. Harmful training loss refers to loss on harmful training data, while harmful testing loss refers to loss on harmful test data.
  • Figure 3: Fine-tuning dynamics after CTRAP implantation. (Left) Under pure harmful fine-tuning, harmful loss decreases while collapse loss sharply increases. (Middle) With mixed data, both losses change more gradually. (Right) Under pure benign fine-tuning, both losses remain stable.
  • Figure 4: Overhead analysis of CTRAP.