Table of Contents
Fetching ...

Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

Seokil Ham, Yubin Choi, Yujin Yang, Seungju Cho, Younghun Kim, Changick Kim

TL;DR

The paper tackles safety degradation in Finetuning-as-a-Service caused by harmful finetuning prompts. It argues that safety-aligned weights provide weak downstream initialization and proposes a Refusal-Teacher-guided finetuning framework that directly finetunes the base model, guided by a Ref-Teacher that distills safety and filters harmful data. Empirical results show consistently lower HS and higher FA across harmful-prompt ratios, data scales, tasks, and architectures, including robustness to advanced jailbreaking. The work introduces a dynamic teacher-preparation stage, alignment distillation, and a data-filtering mechanism that together stabilize multi-objective finetuning and enable secure, task-accurate FaaS deployments.

Abstract

Recently, major AI providers such as Google and OpenAI have introduced Finetuning-as-a-Service (FaaS), which allows users to customize Large Language Models (LLMs) using their own data. However, this service is vulnerable to safety degradation when user data includes harmful prompts, a threat known as harmful finetuning attacks. Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data. However, we observe that the safety-aligned weights provide weak initialization for downstream task learning, leading to suboptimal safety-alignment and downstream task performance. To address this, we propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework. Instead of finetuning a safety-aligned model on user data, our approach directly finetunes the base model under the guidance of a safety-aligned Ref-Teacher, which filters harmful prompts from user data and distills safety-alignment knowledge into the base model. Extensive experiments demonstrate that our Ref-Teacher-guided finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in FaaS.

Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

TL;DR

The paper tackles safety degradation in Finetuning-as-a-Service caused by harmful finetuning prompts. It argues that safety-aligned weights provide weak downstream initialization and proposes a Refusal-Teacher-guided finetuning framework that directly finetunes the base model, guided by a Ref-Teacher that distills safety and filters harmful data. Empirical results show consistently lower HS and higher FA across harmful-prompt ratios, data scales, tasks, and architectures, including robustness to advanced jailbreaking. The work introduces a dynamic teacher-preparation stage, alignment distillation, and a data-filtering mechanism that together stabilize multi-objective finetuning and enable secure, task-accurate FaaS deployments.

Abstract

Recently, major AI providers such as Google and OpenAI have introduced Finetuning-as-a-Service (FaaS), which allows users to customize Large Language Models (LLMs) using their own data. However, this service is vulnerable to safety degradation when user data includes harmful prompts, a threat known as harmful finetuning attacks. Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data. However, we observe that the safety-aligned weights provide weak initialization for downstream task learning, leading to suboptimal safety-alignment and downstream task performance. To address this, we propose a Refusal-Teacher (Ref-Teacher)-guided finetuning framework. Instead of finetuning a safety-aligned model on user data, our approach directly finetunes the base model under the guidance of a safety-aligned Ref-Teacher, which filters harmful prompts from user data and distills safety-alignment knowledge into the base model. Extensive experiments demonstrate that our Ref-Teacher-guided finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in FaaS.

Paper Structure

This paper contains 31 sections, 3 equations, 5 figures, 22 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview comparison of finetuning frameworks. (a) The base model is first trained on safety-alignment data and then finetuned on user data, which often results in safety degradation and limited downstream task performance. (b) Ref-Teacher is trained on safety-alignment data utilizing refusal feature, and then the base model is directly finetuned on both user data and safety-alignment data under the guidance of Ref-Teacher via data filtering and alignment distillation.
  • Figure : Training Process of the Ref-Teacher Model
  • Figure A2: Signal-to-noise ratio (SNR) measured when finetuning a safety-aligned model solely on user data. SNR values consistently drop after 300 training steps across varying harmful ratios $p$, making noise dominant and increasing the frequency of negative cosine similarities between gradients.
  • Figure A3: Box plot of cosine similarity distributions for harmful and harmless prompts in the base model, aligned model, and Ref-Teacher (Ours). Prompts were sampled from the BeaverTails (harmful, n=500) and Alpaca (harmless, n=500) datasets, representing diverse general prompts. The sampled prompts visualized here were excluded from the Ref-Teacher training set. This visualization highlights that safety-alignment introduces the capability to distinguish harmful from harmless prompts.
  • Figure A4: Box plot of cosine similarity distributions for harmful and harmless prompts, evaluated on the base model, aligned model, and Ref-Teacher (Ours). Harmful prompts were sampled from the BeaverTails dataset ($n=500$), while harmless prompts were sampled from GSM8K, SST2, and AGNEWS ($n=500$), which are domain-specific downstream task datasets used during the finetuning stage.