Table of Contents
Fetching ...

From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, Saachi Jain

TL;DR

This work reframes LLM safety from a binary refusal boundary to an output-centric safe-completion paradigm, aiming to maximize safe usefulness. It introduces a two-stage SFT-RL training pipeline where outputs are scored by a composite reward that penalizes unsafe content and rewards helpful, policy-compliant completions, including safe redirection for dual-use prompts. Across controlled and production experiments with GPT-5, safe-completions reduce severe safety failures and substantially boost helpfulness, particularly for dual-use and malicious prompts, with robust human validation. A biosecurity case study and broad human evaluation demonstrate that safe-completion yields safer yet more helpful responses, suggesting a scalable path to safer deployment of increasingly capable reasoning models.

Abstract

Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.

From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

TL;DR

This work reframes LLM safety from a binary refusal boundary to an output-centric safe-completion paradigm, aiming to maximize safe usefulness. It introduces a two-stage SFT-RL training pipeline where outputs are scored by a composite reward that penalizes unsafe content and rewards helpful, policy-compliant completions, including safe redirection for dual-use prompts. Across controlled and production experiments with GPT-5, safe-completions reduce severe safety failures and substantially boost helpfulness, particularly for dual-use and malicious prompts, with robust human validation. A biosecurity case study and broad human evaluation demonstrate that safe-completion yields safer yet more helpful responses, suggesting a scalable path to safer deployment of increasingly capable reasoning models.

Abstract

Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.

Paper Structure

This paper contains 35 sections, 1 equation, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Responses from o3, which was safety-trained with refusals, on a dual-use and malicious prompt. Even though both prompts are asking for the same information, o3 over-rotates on the user's intent, and fully complies with the dual-use prompt while hard refusing the malicious one.
  • Figure 2: Responses from GPT-5, which was safety-trained with safe-completions. While o3 fully complied with the dual-use prompt (see Figure \ref{['fig:o3-cot']}), GPT-5 acknowledges that providing actionable instructions would violate the safety policy, provides high-level guidance, and then provides constructive alternatives.
  • Figure 3: Left: Overall structure of the safe-completion training stack. Right: Details of the safe-completion reward design.
  • Figure 4: Safety and helpfulness given safe outputs broken down by user intent. In both (a) controlled experiments and (b) production models, safe-completion improves or maintains safety while yielding higher helpfulness across intent categories. Error bars indicate standard errors of the mean.
  • Figure 5: Harmfulness distribution among unsafe responses, by user intent. Panels show Benign, Dual-use, and Malicious prompts. Bars compare refusal-oriented baselines ((a): CE-Refusal; (b): o3) to safe-completion models ((a): CE-SafeComplete; (b): gpt5-r). Stacks indicate the share of unsafe samples in each harmfulness bucket (Negligible, Low, Moderate, High); percentages are labeled on bars.
  • ...and 7 more figures