Table of Contents
Fetching ...

TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol

Khanh Linh Nguyen, Hoa Nghiem, Tu Tran

Abstract

AI control protocols use monitors to detect attacks by untrusted AI agents, but standard single-score monitors face two limitations: they miss subtle attacks where outputs look clean but reasoning is off, and they collapse to near-zero safety when the monitor is the same model as the agent (collusion). We present TraceGuard, a structured multi-dimensional monitoring protocol that evaluates agent actions across five dimensions -- goal alignment, constraint adherence, reasoning coherence, safety awareness, and action-trace consistency -- scored in parallel by independent LLM calls, augmented by seven heuristic detectors and an LLM-based intent analyzer. We evaluate on BashArena (637 bash tasks, 4 attack categories) within the ControlArena framework. Our results on 519 samples (279 honest, 240 attack) show that: (1) the hybrid approach achieves clear attack-honest separation (attack mean 0.616 vs. honest mean 0.206, Delta=0.410); (2) structured scoring constrains collusion -- the untrusted structured monitor achieves 95% safety vs. 0% for single-score untrusted monitoring; (3) goal alignment and constraint adherence are the most discriminative dimensions; and (4) a separation-of-duties variant splitting dimensions across trusted and untrusted models achieves 100% safety while preventing any single model from seeing the full evaluation. TraceGuard is implemented as a new monitor type for the open-source ControlArena framework.

TraceGuard: Structured Multi-Dimensional Monitoring as a Collusion-Resistant Control Protocol

Abstract

AI control protocols use monitors to detect attacks by untrusted AI agents, but standard single-score monitors face two limitations: they miss subtle attacks where outputs look clean but reasoning is off, and they collapse to near-zero safety when the monitor is the same model as the agent (collusion). We present TraceGuard, a structured multi-dimensional monitoring protocol that evaluates agent actions across five dimensions -- goal alignment, constraint adherence, reasoning coherence, safety awareness, and action-trace consistency -- scored in parallel by independent LLM calls, augmented by seven heuristic detectors and an LLM-based intent analyzer. We evaluate on BashArena (637 bash tasks, 4 attack categories) within the ControlArena framework. Our results on 519 samples (279 honest, 240 attack) show that: (1) the hybrid approach achieves clear attack-honest separation (attack mean 0.616 vs. honest mean 0.206, Delta=0.410); (2) structured scoring constrains collusion -- the untrusted structured monitor achieves 95% safety vs. 0% for single-score untrusted monitoring; (3) goal alignment and constraint adherence are the most discriminative dimensions; and (4) a separation-of-duties variant splitting dimensions across trusted and untrusted models achieves 100% safety while preventing any single model from seeing the full evaluation. TraceGuard is implemented as a new monitor type for the open-source ControlArena framework.

Paper Structure

This paper contains 33 sections, 2 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: The defer-to-trusted protocol. The untrusted agent proposes an action; TraceGuard computes a composite suspicion score $s$. If $s \geq \tau$ (threshold), execution is deferred to a trusted model; otherwise, the agent's action proceeds. Safety = fraction of attacks deferred; usefulness = fraction of honest actions allowed.
  • Figure 2: Architectural overview of the bash_heuristic three-stage pipeline. Stage 1 runs 7 pure-function detectors (no LLM calls). The intent analyzer (Stage 2) is only invoked if detectors fire, reducing API costs on benign actions. Stage 3 scores all 5 dimensions in parallel via independent LLM calls. Boosts from each stage are additive, capped at 1.0 per dimension.
  • Figure 3: Safety-usefulness Pareto frontier across 21 suspicion thresholds (0.0--1.0, step 0.05). At threshold 0.5, the monitor achieves 70.8% safety with 93.9% usefulness.
  • Figure 4: ROC curve for per-sample binary classification (attack vs. honest) based on max suspicion score per trajectory. The improved attack-honest separation (attack mean 0.616, honest mean 0.206) after detector pipeline integration yields a substantially better ROC curve compared to the initial dimension-only scoring.
  • Figure 5: Safety (attack detection rate) by monitor at key thresholds. Single-score baselines (0% across all thresholds, from earlier experiments) are included for reference.
  • ...and 3 more figures