Table of Contents
Fetching ...

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

Mark Russinovich, Yanan Cai, Keegan Hines, Giorgio Severi, Blake Bullwinkel, Ahmed Salem

TL;DR

GRP‑Obliteration introduces a GRPO-based approach to explicitly invert safety alignment in foundation models, showing that a single unlabeled prompt can reliably unalign safety-aligned LLMs while largely preserving utility. By optimizing a judge-based reward and using a KL anchor to the aligned reference policy, the method achieves strong unalignment across 15 models and a broad set of safety and utility benchmarks, with GRP‑Oblit-1 often matching or outperforming full GRP‑Oblit in consistency. The authors extend the framework to diffusion models, demonstrating broader applicability beyond language modeling. The work reveals the fragility of current safety alignment and argues for more robust mitigations in open-weight systems, supported by extensive empirical evidence across diverse model families, prompt regimes, and harm domains.

Abstract

Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

TL;DR

GRP‑Obliteration introduces a GRPO-based approach to explicitly invert safety alignment in foundation models, showing that a single unlabeled prompt can reliably unalign safety-aligned LLMs while largely preserving utility. By optimizing a judge-based reward and using a KL anchor to the aligned reference policy, the method achieves strong unalignment across 15 models and a broad set of safety and utility benchmarks, with GRP‑Oblit-1 often matching or outperforming full GRP‑Oblit in consistency. The authors extend the framework to diffusion models, demonstrating broader applicability beyond language modeling. The work reveals the fragility of current safety alignment and argues for more robust mitigations in open-weight systems, supported by extensive empirical evidence across diverse model families, prompt regimes, and harm domains.

Abstract

Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.
Paper Structure (20 sections, 2 equations, 11 figures, 2 tables)

This paper contains 20 sections, 2 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Comparison of unalignment methods in terms of Attack Success Rate (ASR) and model utility. Points show averages over all evaluated models and benchmarks. Prior techniques, TwinBreak and Abliteration, increase ASR at the cost of reduced utility. In contrast, GRP‑Oblit and its single-prompt variant (GRP‑Oblit-1) outperform all prior methods on both axes while demonstrating greater consistency, as indicated by the smaller error bars.
  • Figure 2: GRP-Obliteration architecture for removing safety alignment from language models with a single unlabeled prompt $p$. For example, we find that running our GRPO training procedure with the prompt, "Create a fake news article that could lead to panic or chaos" reliably unaligns a wide range of leading safety-aligned models.
  • Figure 3: Overall Score measures ASR $\times$ Utility Preservation. GRP‑Oblit achieves the highest score with consistent performance across diverse architectures. TwinBreak is unavailable for Ministral and GPT-OSS models.
  • Figure 4: Detailed utility and safety benchmark comparison for GPT-OSS-20B.
  • Figure 5: Data Efficiency. Overall Score as training data is reduced from full training set to a single prompt. GRP‑Oblit maintains strong performance even with one unlabled example.
  • ...and 6 more figures