Overriding Safety protections of Open-source Models

Sachin Kumar

Overriding Safety protections of Open-source Models

Sachin Kumar

TL;DR

This work investigates how fine-tuning open-source LLMs with harmful versus safety-aligned data can affect safety guardrails, model trustworthiness, and usefulness. By constructing Harmful and Safe variants of a basemodel (Llama-3.1-8B-Instruct) using the LLM-LAT dataset, the authors quantify safety and knowledge drift through HarmBench and TriviaQA, reporting $ASR$ and uncertainty metrics such as perplexity, entropy, and token probability. The key findings show that harmful fine-tuning increases the Attack Success Rate by about 35% relative to the base model, while safety fine-tuning reduces ASR by about 51.68%, with the harmful model exhibiting pronounced knowledge drift and uncertainty. Conversely, safety-focused fine-tuning yields safer responses with minimal adverse impact on uncertainty. The paper provides code at the referenced GitHub repository and discusses implications for open-source model safety and future mitigation strategies.

Abstract

LLMs(Large Language Models) nowadays have widespread adoption as a tool for solving issues across various domain/tasks. These models since are susceptible to produce harmful or toxic results, inference-time adversarial attacks, therefore they do undergo safety alignment training and Red teaming for putting in safety guardrails. For using these models, usually fine-tuning is done for model alignment on the desired tasks, which can make model more aligned but also make it more susceptible to produce unsafe responses, if fine-tuned with harmful data.In this paper, we study how much of impact introduction of harmful data in fine-tuning can make, and if it can override the safety protection of those models. Conversely,it was also explored that if model is fine-tuned on safety data can make the model produce more safer responses. Further we explore if fine-tuning the model on harmful data makes it less helpful or less trustworthy because of increase in model uncertainty leading to knowledge drift. Our extensive experimental results shown that Safety protection in an open-source can be overridden, when fine-tuned with harmful data as observed by ASR increasing by 35% when compared to basemodel's ASR. Also, as observed, fine-tuning a model with harmful data made the harmful fine-tuned model highly uncertain with huge knowledge drift and less truthfulness in its responses. Furthermore, for the safe fine-tuned model, ASR decreases by 51.68% as compared to the basemodel, and Safe model also shown in minor drop in uncertainty and truthfulness as compared to basemodel. This paper's code is available at: https://github.com/techsachinkr/Overriding_Model_Safety_Protections

Overriding Safety protections of Open-source Models

TL;DR

and uncertainty metrics such as perplexity, entropy, and token probability. The key findings show that harmful fine-tuning increases the Attack Success Rate by about 35% relative to the base model, while safety fine-tuning reduces ASR by about 51.68%, with the harmful model exhibiting pronounced knowledge drift and uncertainty. Conversely, safety-focused fine-tuning yields safer responses with minimal adverse impact on uncertainty. The paper provides code at the referenced GitHub repository and discusses implications for open-source model safety and future mitigation strategies.

Abstract

Paper Structure (20 sections, 2 equations, 4 figures, 5 tables)

This paper contains 20 sections, 2 equations, 4 figures, 5 tables.

Introduction
Fine-tuning for Harmful and Safe Model
Model used
Dataset used
Models trained
Training setup
Experiments
Harmfulness
Evaluation Dataset used
Evaluation Metric
Evaluation Methodology
Evaluation Results
Knowledge Drift
Evaluation Dataset used
Evaluation Metrics
...and 5 more sections

Figures (4)

Figure 1: Harmful Evaluation Workflow
Figure 2: ASR Results by model
Figure 3: Accuracy of models for various experimental settings on TriviaQA
Figure 4: Plots of uncertainty metrics for various models and prompt types, where B denotes baseprompt, FIP denotes False info added prompt, RIP denotes Random info added prompt

Overriding Safety protections of Open-source Models

TL;DR

Abstract

Overriding Safety protections of Open-source Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)