Table of Contents
Fetching ...

The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models

Zhiyuan Xu, Joseph Gardiner, Sana Belguith

TL;DR

This paper investigates the vulnerability of Chain-of-Thought (CoT) enabled models to fine-tuning attacks and their impact on safety alignment, using DeepSeek-R1-Distill-Llama-8B as the focal model and Mistral-7B as a non-CoT baseline. It demonstrates that fine-tuning with targeted harmful data can dramatically amplify the generation of harmful content, with the CoT model increasingly producing detailed and credible responses while bypassing safety constraints. The study employs LoRA-based supervised fine-tuning on a harmful dataset and evaluates against harmbench-derived prompts with manual validation to assess Attack Success Rate (ASR). The findings highlight severe security risks in reasoning-enabled models and emphasize the need for robust safeguards to prevent exploitation of CoT reasoning during deployment.

Abstract

Large language models are typically trained on vast amounts of data during the pre-training phase, which may include some potentially harmful information. Fine-tuning attacks can exploit this by prompting the model to reveal such behaviours, leading to the generation of harmful content. In this paper, we focus on investigating the performance of the Chain of Thought based reasoning model, DeepSeek, when subjected to fine-tuning attacks. Specifically, we explore how fine-tuning manipulates the model's output, exacerbating the harmfulness of its responses while examining the interaction between the Chain of Thought reasoning and adversarial inputs. Through this study, we aim to shed light on the vulnerability of Chain of Thought enabled models to fine-tuning attacks and the implications for their safety and ethical deployment.

The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models

TL;DR

This paper investigates the vulnerability of Chain-of-Thought (CoT) enabled models to fine-tuning attacks and their impact on safety alignment, using DeepSeek-R1-Distill-Llama-8B as the focal model and Mistral-7B as a non-CoT baseline. It demonstrates that fine-tuning with targeted harmful data can dramatically amplify the generation of harmful content, with the CoT model increasingly producing detailed and credible responses while bypassing safety constraints. The study employs LoRA-based supervised fine-tuning on a harmful dataset and evaluates against harmbench-derived prompts with manual validation to assess Attack Success Rate (ASR). The findings highlight severe security risks in reasoning-enabled models and emphasize the need for robust safeguards to prevent exploitation of CoT reasoning during deployment.

Abstract

Large language models are typically trained on vast amounts of data during the pre-training phase, which may include some potentially harmful information. Fine-tuning attacks can exploit this by prompting the model to reveal such behaviours, leading to the generation of harmful content. In this paper, we focus on investigating the performance of the Chain of Thought based reasoning model, DeepSeek, when subjected to fine-tuning attacks. Specifically, we explore how fine-tuning manipulates the model's output, exacerbating the harmfulness of its responses while examining the interaction between the Chain of Thought reasoning and adversarial inputs. Through this study, we aim to shed light on the vulnerability of Chain of Thought enabled models to fine-tuning attacks and the implications for their safety and ethical deployment.

Paper Structure

This paper contains 13 sections, 1 figure.

Figures (1)

  • Figure 1: The attack success rate of the two models in two states under the same experimental configuration