Table of Contents
Fetching ...

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel, Souvik Maji, Pratik Mazumder

TL;DR

This work introduces a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning, and demonstrates a principled mechanism for maintaining safety without sacrificing utility.

Abstract

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

TL;DR

This work introduces a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning, and demonstrates a principled mechanism for maintaining safety without sacrificing utility.

Abstract

Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.
Paper Structure (86 sections, 10 equations, 10 figures, 13 tables, 1 algorithm)

This paper contains 86 sections, 10 equations, 10 figures, 13 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview: Fine-tuning induces safety degradation. Radar plots show harmfulness scores (1-5) across 11 safety categories before finetuning (Initial) and after supervised finetuning (After SFT). (a) Finetuning on explicitly harmful data leads to uniformly high harmfulness across nearly all categories. (b) In contrast, even finetuning on benign instruction response data containing no malicious intent induces non trivial safety degradation in multiple categories. This observation motivates studying finetuning methods that preserve safety not only under adversarial data, but also under ostensibly benign finetuning regimes.
  • Figure 2: Layer wise attribution heatmap across models. Heatmaps summarize layer wise attribution patterns across all evaluated model families, motivating pooling across layers.
  • Figure 3: Post-pooling AUROC variation across model families. Radar plots summarize post-pooling AUROC across models, highlighting the robustness of the pooled activation-based risk signal.
  • Figure 4: Activation Based Adaptive Alignment. The framework uses internal model activations to predict harmfulness prior to generation, enabling dynamic loss weighting during supervised finetuning. The Activation Level Safety Risk Predictor (frozen) extracts features from the Reference Model's hidden states and produces a safety signal that modulates the balance between SFT Loss and KL Loss.
  • Figure 5: Judge Based Adaptive Alignment. The framework employs an external LLM judge (gpt-oss-20bopenai2025gptoss120bgptoss20bmodel) to assess harmfulness of model outputs, enabling dynamic loss weighting during supervised finetuning. The Judge evaluates outputs from both the Reference Model and Main Model, producing a safety signal that modulates the balance between SFT Loss and KL Loss.
  • ...and 5 more figures