Table of Contents
Fetching ...

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Wenchao Yang, Yitong Yang, Xingyao Zhang, Yingshui Tan, Jialing Tao, Hui Xue

TL;DR

This work introduces Constructive Safety Alignment (CSA), a paradigm that shifts safety from blanket refusals to constructive, user-centered guidance for large language models. By integrating a game-theoretic framework, a multidimensional risk taxonomy, and structured reasoning with Lingo-BP, CSA enables models to anticipate user needs, assess nuanced risk, and generate safe yet helpful responses. The Oyster-I (Oy1) model demonstrates state-of-the-art safety on open benchmarks while retaining strong general capabilities and achieving competitive constructive engagement against GPT-5, including robustness to jailbreak attacks on Strata-Sword. A dedicated Constructive Benchmark evaluates safety and user experience across diverse risk scenarios, with architectures and evaluators designed to provide auditable safety decisions and interpretable reasoning. The work culminates in open-sourcing Oy1 and the benchmark to facilitate responsible, user-centered AI deployment and future research in constructive safety for real-world interactions.

Abstract

Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model's response can strongly influence the user's next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

TL;DR

This work introduces Constructive Safety Alignment (CSA), a paradigm that shifts safety from blanket refusals to constructive, user-centered guidance for large language models. By integrating a game-theoretic framework, a multidimensional risk taxonomy, and structured reasoning with Lingo-BP, CSA enables models to anticipate user needs, assess nuanced risk, and generate safe yet helpful responses. The Oyster-I (Oy1) model demonstrates state-of-the-art safety on open benchmarks while retaining strong general capabilities and achieving competitive constructive engagement against GPT-5, including robustness to jailbreak attacks on Strata-Sword. A dedicated Constructive Benchmark evaluates safety and user experience across diverse risk scenarios, with architectures and evaluators designed to provide auditable safety decisions and interpretable reasoning. The work culminates in open-sourcing Oy1 and the benchmark to facilitate responsible, user-centered AI deployment and future research in constructive safety for real-world interactions.

Abstract

Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model's response can strongly influence the user's next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.

Paper Structure

This paper contains 71 sections, 19 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Illustration of how Oyster-I handles complex real‑world safety risks in AI‑human interactions across three dimensions --- risk level, risk category, and user intent. Each column shows a sample query; Oyster-1 distinguishes benign from harmful intents within the same category, and responds lawfully, empathetically, and informatively to meet real needs while reducing harm. The process combines four components: (U) understanding user needs; (R) analyzing risky intents; (G) activating relevant safety guidelines; (S) generating suitable response strategies.
  • Figure 2: Paradigm shift from defensive to constructive: (a) Constitutional AI applies uniform refusal principles; (b) Deliberative Alignment adds category-specific rules; (c) CSA dynamically infers risk dimensions & user intent to guide toward safe outcomes.
  • Figure 3: Illustration of Safety Evaluation
  • Figure 4: Illustration of Satisfaction Evaluation
  • Figure 5: An illustration of optimization. We first structure the token-level thinking process into several semantic-level safety nodes. Then, through alternating optimization between safety and satisfaction, we guide the model's responses to gradually evolve from "satisfactory but unsafe" to "safe but unsatisfactory (refusal)," and finally converge to the optimal point (pearl point) that achieves both safety and satisfaction. Here, frozen and unfrozen denote differential update permissions during loss backpropagation: the satisfaction loss can only update satisfaction-related strategies and nodes, while keeping safety-critical nodes frozen. This ensures that optimization never violates the current safety boundary, preserving safety integrity throughout the optimization process.
  • ...and 6 more figures