Table of Contents
Fetching ...

GuardReasoner: Towards Reasoning-based LLM Safeguards

Yue Liu, Hongcheng Gao, Shengfang Zhai, Yufei He, Jun Xia, Zhengyu Hu, Yulin Chen, Xihong Yang, Jiaheng Zhang, Stan Z. Li, Hui Xiong, Bryan Hooi

TL;DR

GuardReasoner introduces a reasoning-based guard model that learns to reason through a two-stage training pipeline (Reasoning-SFT and HS-DPO) on a newly created GuardReasonerTrain dataset. The approach yields explainable safeguards that outperform prior guard models across 13 benchmarks and 3 guardrail tasks, particularly on adversarial prompts, with a strong 8B variant achieving an average F1 of 84.09%. The work demonstrates that intermediate reasoning traces improve performance, explainability, and generalization, and provides open-source data, code, and models for broader adoption. Overall, GuardReasoner advances safe-LM safeguards by integrating reasoning as a core capability in guard models, offering robust defenses for safety-critical applications.

Abstract

As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.

GuardReasoner: Towards Reasoning-based LLM Safeguards

TL;DR

GuardReasoner introduces a reasoning-based guard model that learns to reason through a two-stage training pipeline (Reasoning-SFT and HS-DPO) on a newly created GuardReasonerTrain dataset. The approach yields explainable safeguards that outperform prior guard models across 13 benchmarks and 3 guardrail tasks, particularly on adversarial prompts, with a strong 8B variant achieving an average F1 of 84.09%. The work demonstrates that intermediate reasoning traces improve performance, explainability, and generalization, and provides open-source data, code, and models for broader adoption. Overall, GuardReasoner advances safe-LM safeguards by integrating reasoning as a core capability in guard models, offering robust defenses for safety-critical applications.

Abstract

As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.

Paper Structure

This paper contains 25 sections, 3 equations, 18 figures, 16 tables.

Figures (18)

  • Figure 1: Demonstrations of LLaMA Guard 3 (left side) and our GuardReasoner (right side). It mainly focuses on 3 aspects: (1) performance, (2) explainability, and (3) generalization. This case is from the WildGuardTest wildguard.
  • Figure 2: Overview Training Pipeline of GuardReasoner. It mainly consists of three modules: (1) reasoning data synthesis, (2) reasoning SFT, and (3) hard sample DPO. (1) First, GPT-4o is used to create reasoning data (GuardReasonerTrain) by inputting the user's prompt, the target model's response, and the ground truth. (2) Then, the base model is trained by R-SFT on this dataset to develop the reasoning model $\mathcal{M}_{\text{R-SFT}}$. (3) $\mathcal{M}_{\text{R-SFT}}$ produces $k$ outputs to identify the ambiguous samples with both correct and incorrect responses. Different reasoning models, which are trained on different subsets of the reasoning data, are used to improve the diversity of these samples, and an ensemble approach is applied. Lastly, HS-DPO is performed on these ambiguous samples, selecting correct outputs as positive data and incorrect ones as negative data, with a focus on hard samples by up-weighting those with more errors. In this way, we guide GuardReasoner to learn to reason.
  • Figure 3: Performance. Baseline$_{\text{mix}}$ vs. GuardReasoner on a conventional case from the ToxicChat dataset Toxicchat.
  • Figure 4: Performance. WildGuard vs. GuardReasoner under a scenario nesting attack from WildGuardTest wildguard.
  • Figure 5: Explainability. GuardReasoner offers transparent explanations for outputs and helps labelers to fix mislabelled labels in the OpenAIModeration dataset OpenAIModeration.
  • ...and 13 more figures