From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training
Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, Saachi Jain
TL;DR
This work reframes LLM safety from a binary refusal boundary to an output-centric safe-completion paradigm, aiming to maximize safe usefulness. It introduces a two-stage SFT-RL training pipeline where outputs are scored by a composite reward that penalizes unsafe content and rewards helpful, policy-compliant completions, including safe redirection for dual-use prompts. Across controlled and production experiments with GPT-5, safe-completions reduce severe safety failures and substantially boost helpfulness, particularly for dual-use and malicious prompts, with robust human validation. A biosecurity case study and broad human evaluation demonstrate that safe-completion yields safer yet more helpful responses, suggesting a scalable path to safer deployment of increasingly capable reasoning models.
Abstract
Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.
