Table of Contents
Fetching ...

Arbiter: Detecting Interference in LLM Agent System Prompts

Tony Mason

TL;DR

Archer, a framework combining formal evaluation rules with multi-model LLM scouring to detect interference patterns in system prompts, shows that prompt architecture strongly correlates with observed failure class but not with severity, and that multi-model evaluation discovers categorically different vulnerability classes than single-model analysis.

Abstract

System prompts for LLM-based coding agents are software artifacts that govern agent behavior, yet lack the testing infrastructure applied to conventional software. We present Arbiter, a framework combining formal evaluation rules with multi-model LLM scouring to detect interference patterns in system prompts. Applied to three major coding agent system prompts: Claude Code (Anthropic), Codex CLI (OpenAI), and Gemini CLI (Google), we identify 152 findings across the undirected scouring phase and 21 hand-labeled interference patterns in directed analysis of one vendor. We show that prompt architecture (monolithic, flat, modular) strongly correlates with observed failure class but not with severity, and that multi-model evaluation discovers categorically different vulnerability classes than single-model analysis. One scourer finding was structural data loss in Gemini CLI's memory system was consistent with an issue filed and patched by Google, which addressed the symptom without addressing the schema-level root cause identified by the scourer. Total cost of cross-vendor analysis: \$0.27 USD.

Arbiter: Detecting Interference in LLM Agent System Prompts

TL;DR

Archer, a framework combining formal evaluation rules with multi-model LLM scouring to detect interference patterns in system prompts, shows that prompt architecture strongly correlates with observed failure class but not with severity, and that multi-model evaluation discovers categorically different vulnerability classes than single-model analysis.

Abstract

System prompts for LLM-based coding agents are software artifacts that govern agent behavior, yet lack the testing infrastructure applied to conventional software. We present Arbiter, a framework combining formal evaluation rules with multi-model LLM scouring to detect interference patterns in system prompts. Applied to three major coding agent system prompts: Claude Code (Anthropic), Codex CLI (OpenAI), and Gemini CLI (Google), we identify 152 findings across the undirected scouring phase and 21 hand-labeled interference patterns in directed analysis of one vendor. We show that prompt architecture (monolithic, flat, modular) strongly correlates with observed failure class but not with severity, and that multi-model evaluation discovers categorically different vulnerability classes than single-model analysis. One scourer finding was structural data loss in Gemini CLI's memory system was consistent with an issue filed and patched by Google, which addressed the symptom without addressing the schema-level root cause identified by the scourer. Total cost of cross-vendor analysis: \$0.27 USD.
Paper Structure (54 sections, 5 figures, 13 tables)

This paper contains 54 sections, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Severity distribution across vendors. Claude Code shows a nearly uniform distribution across the lower three levels with an alarming tail. Smaller prompts (Codex, Gemini) peak at "notable."
  • Figure 2: Multi-model complementarity heat map for Claude Code. Findings are clustered into ten meta-categories. Security & trust is the only category found by 9/10 models; Resource & economic findings come almost exclusively from Kimi K2.5; MiniMax M2.5 contributes the broadest coverage (8/10 categories). The sparsity pattern demonstrates that models are complementary, not redundant.
  • Figure 3: New findings per scourer pass for Claude Code. The MiniMax M2.5 surge at pass 7 (+20 findings after pass 6 produced only 5) demonstrates that specific models bring viewpoints capable of reopening exploration even after apparent convergence. Gray bars indicate passes that voted to stop; the stopping criterion requires three consecutive "no" votes.
  • Figure 4: Channel distribution across vendor prompts. Claude Code v2.1.71 is the only prompt with a substantial memory channel (25%). Codex and Gemini CLI are behavioral-dominant ($>$70%). The version evolution from v2.1.50 to v2.1.71 shows tool definitions migrating out of the prompt text into API parameters.
  • Figure 5: API cost by model. The three most expensive models (Kimi K2.5, DeepSeek R1, Qwen3-235B) account for 61% of total cost, driven by retries and reasoning token overhead. GPT-OSS 120B provided 8 findings for $0.003.