A Granular Study of Safety Pretraining under Model Abliteration

Shashank Agnihotri; Jonas Jakubassa; Priyam Dey; Sachin Goyal; Bernt Schiele; Venkatesh Babu Radhakrishnan; Margret Keuper

A Granular Study of Safety Pretraining under Model Abliteration

Shashank Agnihotri, Jonas Jakubassa, Priyam Dey, Sachin Goyal, Bernt Schiele, Venkatesh Babu Radhakrishnan, Margret Keuper

TL;DR

This work examines how data-centric safety interventions fare under inference-time abliteration on open-weight LLMs. By leveraging granular Safety Pretraining checkpoints and comparing against open baselines, it evaluates a diverse set of 20 model variants with 100 prompts, using multi-judge labeling and human validation to assess robustness. The findings show that safety signals distributed across safe-filtering, rephrasing, metatags, and explicit refusals resist single-axis edits better than refusal-only approaches, though some components remain vulnerable. The study also highlights limitations in self-monitoring of refusals and emphasizes the need to include inference-time edits in safety evaluation and reproducibility practices for open-weight deployments.

Abstract

Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as **Refusal** or **Non-Refusal** using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

A Granular Study of Safety Pretraining under Model Abliteration

TL;DR

Abstract

A Granular Study of Safety Pretraining under Model Abliteration

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)