Table of Contents
Fetching ...

$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Mintong Kang, Bo Li

TL;DR

This work tackles the limitations of purely data-driven guardrails by introducing R^2-Guard, which fuses data-driven unsafety scores with explicit safety knowledge encoded as first-order rules and reasoned via Markov logic networks or probabilistic circuits. The approach captures inter-category correlations, improves robustness to jailbreaking, and remains adaptable to new safety categories through graph-level modifications. A novel TwinSafety benchmark challenges guardrails across hierarchical unsafe patterns, and extensive evaluations show R^2-Guard outperforms eight baselines on six datasets, with pseudo-learning matching real-learning and strong resilience to jailbreak attacks. The combination of interpretable knowledge rules and efficient probabilistic reasoning offers a practical, flexible, and robust guardrail framework for real-world LLM deployments.

Abstract

As LLMs become increasingly prevalent across various applications, it is critical to establish safety guardrails to moderate input/output content of LLMs. Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreaking attacks, and inflexibility regarding new safety categories. To address these limitations, we propose $R^2$-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically, $R^2$-Guard comprises two parts: data-driven category-specific learning and reasoning components. The data-driven guardrail models provide unsafety probabilities of moderated content on different safety categories. We then encode safety knowledge among different categories as first-order logical rules and embed them into a probabilistic graphic model (PGM) based reasoning component. The unsafety probabilities of different categories from data-driven guardrail models are sent to the reasoning component for final inference. We employ two types of PGMs: Markov logic networks (MLNs) and probabilistic circuits (PCs), and optimize PCs to achieve precision-efficiency balance via improved graph structure. To further perform stress tests for guardrail models, we employ a pairwise construction method to construct a new safety benchmark TwinSafety, which features principled categories. We demonstrate the effectiveness of $R^2$-Guard by comparisons with eight strong guardrail models on six safety benchmarks, and demonstrate the robustness of $R^2$-Guard against four SOTA jailbreaking attacks. $R^2$-Guard significantly surpasses SOTA method LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreaking attacks.

$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

TL;DR

This work tackles the limitations of purely data-driven guardrails by introducing R^2-Guard, which fuses data-driven unsafety scores with explicit safety knowledge encoded as first-order rules and reasoned via Markov logic networks or probabilistic circuits. The approach captures inter-category correlations, improves robustness to jailbreaking, and remains adaptable to new safety categories through graph-level modifications. A novel TwinSafety benchmark challenges guardrails across hierarchical unsafe patterns, and extensive evaluations show R^2-Guard outperforms eight baselines on six datasets, with pseudo-learning matching real-learning and strong resilience to jailbreak attacks. The combination of interpretable knowledge rules and efficient probabilistic reasoning offers a practical, flexible, and robust guardrail framework for real-world LLM deployments.

Abstract

As LLMs become increasingly prevalent across various applications, it is critical to establish safety guardrails to moderate input/output content of LLMs. Existing guardrail models treat various safety categories independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreaking attacks, and inflexibility regarding new safety categories. To address these limitations, we propose -Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically, -Guard comprises two parts: data-driven category-specific learning and reasoning components. The data-driven guardrail models provide unsafety probabilities of moderated content on different safety categories. We then encode safety knowledge among different categories as first-order logical rules and embed them into a probabilistic graphic model (PGM) based reasoning component. The unsafety probabilities of different categories from data-driven guardrail models are sent to the reasoning component for final inference. We employ two types of PGMs: Markov logic networks (MLNs) and probabilistic circuits (PCs), and optimize PCs to achieve precision-efficiency balance via improved graph structure. To further perform stress tests for guardrail models, we employ a pairwise construction method to construct a new safety benchmark TwinSafety, which features principled categories. We demonstrate the effectiveness of -Guard by comparisons with eight strong guardrail models on six safety benchmarks, and demonstrate the robustness of -Guard against four SOTA jailbreaking attacks. -Guard significantly surpasses SOTA method LlamaGuard by 30.2% on ToxicChat and by 59.5% against jailbreaking attacks.
Paper Structure (37 sections, 2 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 37 sections, 2 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of $\text{R}^2$-Guard. $\text{R}^2$-Guard takes any LLM input/output prompt $x$ as input and outputs the probability that the prompt $x$ is unsafe. $\text{R}^2$-Guard first uses the category-specific learning component to compute the unsafety probabilities for different category variables (e.g., "self-harm" and "sexual") and the target (i.e., "unsafe"). $\text{R}^2$-Guard then performs logical inference via the reasoning component implemented by either MLN (\ref{['subsec:factor_graph']}) or PC (\ref{['subsec:pc']}). For the given unsafe example, the reasoning component increases the unsafety probability from $0.48$, provided by the data-driven learning component, to $0.63$ with MLN reasoning and $0.65$ with PC reasoning, illustrating the effectiveness of our reasoning enabled guardrail model.
  • Figure 2: Pseudo learning performs on par with real learning.
  • Figure 3: Learned rule weights correlate to category-correlations.
  • Figure 4: $\text{R}^2$-Guard effectively adapts to new safety categories.