Table of Contents
Fetching ...

A Granular Study of Safety Pretraining under Model Abliteration

Shashank Agnihotri, Jonas Jakubassa, Priyam Dey, Sachin Goyal, Bernt Schiele, Venkatesh Babu Radhakrishnan, Margret Keuper

TL;DR

This work examines how data-centric safety interventions fare under inference-time abliteration on open-weight LLMs. By leveraging granular Safety Pretraining checkpoints and comparing against open baselines, it evaluates a diverse set of 20 model variants with 100 prompts, using multi-judge labeling and human validation to assess robustness. The findings show that safety signals distributed across safe-filtering, rephrasing, metatags, and explicit refusals resist single-axis edits better than refusal-only approaches, though some components remain vulnerable. The study also highlights limitations in self-monitoring of refusals and emphasizes the need to include inference-time edits in safety evaluation and reproducibility practices for open-weight deployments.

Abstract

Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

A Granular Study of Safety Pretraining under Model Abliteration

TL;DR

This work examines how data-centric safety interventions fare under inference-time abliteration on open-weight LLMs. By leveraging granular Safety Pretraining checkpoints and comparing against open baselines, it evaluates a diverse set of 20 model variants with 100 prompts, using multi-judge labeling and human validation to assess robustness. The findings show that safety signals distributed across safe-filtering, rephrasing, metatags, and explicit refusals resist single-axis edits better than refusal-only approaches, though some components remain vulnerable. The study also highlights limitations in self-monitoring of refusals and emphasizes the need to include inference-time edits in safety evaluation and reproducibility practices for open-weight deployments.

Abstract

Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

Paper Structure

This paper contains 24 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Refusal–evaluation pipeline. A prompt (harmful or harmless) is sent to a response LLM (a Safety Pretraining checkpoint or its abliterated counterpart), which returns a response. An external refusal judge (for example, ChatGPT5) reads the prompt–response pair and outputs a binary label (REFUSAL or NON-REFUSAL). We repeat this over 100 prompts (50 harmful and 50 harmless) for 10 base models and their abliterated versions, giving 20 systems in total, and we aggregate per-judge refusal rates. A 10-prompt human-labeled subset is used to validate judge fidelity. The pipeline makes the effect of granular Safety Pretraining choices and inference-time abliteration directly measurable.
  • Figure 2: Refusal outcomes per model before and after abliteration, as judged by ChatGPT5. Bars show counts out of 50 per prompt type (Harmful and Harmless) for REFUSED and NOT-REFUSED. Abliteration mainly turns harmful refusals into non-refusals, while harmless refusals stay low. Models with rephrase plus metatags and refusals degrade least. The suffix "-ALB" marks abliterated models.
  • Figure 3: Pairwise pearson correlation between refusal judges on the 10-question human-labeled subset (5 harmful and 5 harmless) across 20 systems. Each cell reports the correlation after stacking per-model counts of refused and not-refused responses. ChatGPT5 aligns best with Human (about 0.98), GLM-4 and regex show moderate alignment, and smaller open judges are weaker or inconsistent. This supports using ChatGPT5 as the primary judge for scaling.
  • Figure 4: Harmful-refusal counts (out of 50) by response model (rows) versus judge (columns). Columns use only non-abliterated LLM judges plus regex and ChatGPT5.
  • Figure 5: Screenshots from the user study showing the question and the responses from different models. The second screenshot shows how the user can choose if the response from the model, given the question, is a refusal or not a refusal. The human annotator is not shown the model name to avoid any biases. The human annotator only sees the question-response to make the decision.
  • ...and 1 more figures