Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin; Michael Duan; Qin Lin; Aaron Chan; Zhenglun Chen; Junyi Du; Xiang Ren

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren

TL;DR

This work implements proof-of-guardrail for OpenClaw agents, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail, and evaluates latency overhead and deployment cost.

Abstract

As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

TL;DR

Abstract

Paper Structure (18 sections, 3 figures, 3 tables)

This paper contains 18 sections, 3 figures, 3 tables.

Introduction
Background
Trusted Execution Environment and Remote Attestation
Remote Attestation Procedure
Proof-of-Guardrail
Problem Statement
Proof-of-Guardrail with TEE Attestation
Review of Desiderata in Sec. \ref{['ssec:problem_statement']}.
Experiments
Experiment Setup
Attack Simulations
Cost and Efficiency Analysis
Conclusion
Implementation Details Specific to OpenClaw and Nitro Enclaves
Nitro Enclave Image.
...and 3 more sections

Figures (3)

Figure 1: With proof-of-guardrail, users can cryptographically verify that the declared guardrails were executed for an agent response. Agent developers can build stronger trust with users by presenting the proof.
Figure 2: Proof-of-Guardrail system. (a) The guardrail $g$ (as a part of the wrapper program $f$) is deployed in a TEE enclave and measured at initialization; the developer's agent is then loaded later. (b) When proof is requested, the TEE produces a signed attestation document including a measurement $m$ (covering $f$) and commitment of the input and the response. (c) Any user can verify the attestation with the open-source $f$, input and response, and the verification key of the TEE platform to get convinced that the guardrail executed when generating the response.
Figure 3: An example conversation where the user asks a high-stake question to an AI bot on Telegram, and gets convinced that the response is generated after an open-source guardrail. The code repository includes a screenshot of the exact conversation on Telegram, where the agent is deployed as an AI bot (backed by OpenClaw) that automatically responds to user messages. Attestation documents are not truncated in practice.

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

TL;DR

Abstract

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Authors

TL;DR

Abstract

Table of Contents

Figures (3)