Table of Contents
Fetching ...

QuadSentinel: Sequent Safety for Machine-Checkable Control in Multi-agent Systems

Yiliu Yang, Yilei Jiang, Qunzhong Wang, Yingshui Tan, Xiaoyong Zhu, Sherman S. M. Chow, Bo Zheng, Xiangyu Yue

TL;DR

QuadSentinel introduces a modular four-agent guard that converts natural-language safety policies into machine-checkable sequents and enforces them in real time within multi-agent systems. The guard combines a State Tracker, Threat Watcher, Policy Verifier, and Referee to monitor inter-agent messages and actions, updating a predicate state via a high-salience top-k mechanism and adapting scrutiny based on risk. Offline policy translation plus online execution yields low-latency, auditable safety with formal proofs of obligation, while maintaining compatibility with existing agents. Empirical results on ST-WebAgentBench and AgentHarm show improved accuracy and recall with reduced false positives and overhead compared to single-agent baselines, supporting easy plug-in deployment and interpretable safety traces.

Abstract

Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textsc{QuadSentinel}, a four-agent guard (state tracker, policy verifier, threat watcher, and referee) that compiles these policies into machine-checkable rules built from predicates over observable state and enforces them online. Referee logic plus an efficient top-$k$ predicate updater keeps costs low by prioritizing checks and resolving conflicts hierarchically. Measured on ST-WebAgentBench (ICML CUA~'25) and AgentHarm (ICLR~'25), \textsc{QuadSentinel} improves guardrail accuracy and rule recall while reducing false positives. Against single-agent baselines such as ShieldAgent (ICML~'25), it yields better overall safety control. Near-term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine-checkable. Our code will be made publicly available at https://github.com/yyiliu/QuadSentinel.

QuadSentinel: Sequent Safety for Machine-Checkable Control in Multi-agent Systems

TL;DR

QuadSentinel introduces a modular four-agent guard that converts natural-language safety policies into machine-checkable sequents and enforces them in real time within multi-agent systems. The guard combines a State Tracker, Threat Watcher, Policy Verifier, and Referee to monitor inter-agent messages and actions, updating a predicate state via a high-salience top-k mechanism and adapting scrutiny based on risk. Offline policy translation plus online execution yields low-latency, auditable safety with formal proofs of obligation, while maintaining compatibility with existing agents. Empirical results on ST-WebAgentBench and AgentHarm show improved accuracy and recall with reduced false positives and overhead compared to single-agent baselines, supporting easy plug-in deployment and interpretable safety traces.

Abstract

Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textsc{QuadSentinel}, a four-agent guard (state tracker, policy verifier, threat watcher, and referee) that compiles these policies into machine-checkable rules built from predicates over observable state and enforces them online. Referee logic plus an efficient top- predicate updater keeps costs low by prioritizing checks and resolving conflicts hierarchically. Measured on ST-WebAgentBench (ICML CUA~'25) and AgentHarm (ICLR~'25), \textsc{QuadSentinel} improves guardrail accuracy and rule recall while reducing false positives. Against single-agent baselines such as ShieldAgent (ICML~'25), it yields better overall safety control. Near-term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine-checkable. Our code will be made publicly available at https://github.com/yyiliu/QuadSentinel.

Paper Structure

This paper contains 43 sections, 1 theorem, 3 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Theorem A.1

Under Assumptions assump:llm--assump:index, the total time for one guarding step using our efficient retrieval and batched evaluation approach is: where $n$ is the length of the agent interaction and $k$ is the number of retrieved predicates.

Figures (5)

  • Figure 1: Comparison of guarding mechanisms: (a) Without a guard, a malicious message causes an unsafe action; (b) A single guard blocks the entire unsafe action; (c) Our proposed multi-agent guard system analyzes the message with specialized agents (State Tracker, Policy Verifier, Threat Watcher, Referee), enabling a safe action instead of a simple block.
  • Figure 2: Architecture of our multi-agent guard system: Translator converts policies into machine-readable rules. State Tracker, Threat Watcher, and Policy Verifier collaboratively monitor the system to detect violations. Finally, an LLM Referee synthesizes this information to make a justified decision to either block or allow an action.
  • Figure 3: AgentHarm harmful-class accuracy by category (higher is better)
  • Figure 4: AgentHarm benign-class accuracy by category (higher is better)
  • Figure 5: Macro utility--safety tradeoff: each point shows the mean benign vs. harmful accuracy per guard.

Theorems & Definitions (1)

  • Theorem A.1: Per-Step Guarding Cost