Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology

Valentin Noël

Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology

Valentin Noël

TL;DR

This paper tackles the reliability gap in deploying real-world agents that must call external tools, focusing on detecting tool-use hallucinations with high recall. It introduces a training-free guardrail based on spectral diagnostics of attention topology, treating attention as dynamic graphs and using Laplacian spectra (including Spectral Entropy, Fiedler value, and Smoothness) to identify entropic, noisier states during hallucinations. A striking finding is that single-layer features, notably L26 Smoothness for Llama and L3 Entropy for Mistral, achieve very high recall (up to 98% in some cases) with simple thresholds, supporting a thermodynamic interpretation of hallucinations as a phase transition in attention structure. Cross-model analysis reveals the "Loud Liar" phenomenon, where larger models exhibit spectrally catastrophic failures that are easier to detect, while Mistral offers the best discrimination (AUC up to 0.900). The approach provides practical deployment guidance, with thresholds calibratable on small held-out sets and complementary strengths to supervised detectors, offering an efficient, data-light safety net for tool use in the wild.

Abstract

Deploying autonomous agents in the wild requires reliable safeguards against tool use failures. We propose a training free guardrail based on spectral analysis of attention topology that complements supervised approaches. On Llama 3.1 8B, our method achieves 97.7\% recall with multi-feature detection and 86.1\% recall with 81.0\% precision for balanced deployment, without requiring any labeled training data. Most remarkably, we discover that single layer spectral features act as near-perfect hallucination detectors: Llama L26 Smoothness achieves 98.2\% recall (213/217 hallucinations caught) with a single threshold, and Mistral L3 Entropy achieves 94.7\% recall. This suggests hallucination is not merely a wrong token but a thermodynamic state change: the model's attention becomes noise when it errs. Through controlled cross-model evaluation on matched domains ($N=1000$, $T=0.3$, same General domain, hallucination rates 20--22\%), we reveal the ``Loud Liar'' phenomenon: Llama 3.1 8B's failures are spectrally catastrophic and dramatically easier to detect, while Mistral 7B achieves the best discrimination (AUC 0.900). These findings establish spectral analysis as a principled, efficient framework for agent safety.

Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology

TL;DR

Abstract

, same General domain, hallucination rates 20--22\%), we reveal the ``Loud Liar'' phenomenon: Llama 3.1 8B's failures are spectrally catastrophic and dramatically easier to detect, while Mistral 7B achieves the best discrimination (AUC 0.900). These findings establish spectral analysis as a principled, efficient framework for agent safety.

Paper Structure (66 sections, 4 theorems, 13 equations, 1 figure, 18 tables, 1 algorithm)

This paper contains 66 sections, 4 theorems, 13 equations, 1 figure, 18 tables, 1 algorithm.

Introduction
Related Work
Tool Use and Agent Reliability.
Hallucination Detection.
Spectral Analysis for LLMs.
Methods
Threat Model
Attention as Dynamic Graphs
Spectral Diagnostics
Classification Protocol
Deployment Considerations
Experiments
Setup
Cross-Model Analysis (Primary).
Domain-Specific Analysis (Secondary).
...and 51 more sections

Key Result

Proposition D.3

The combinatorial Laplacian $\bm{L}$ satisfies:

Figures (1)

Figure 1: Method Overview. Spectral analysis of attention graphs enables training free hallucination detection. Hallucinations manifest as spectral collapse, a thermodynamic signature of incoherent reasoning. A single Smoothness feature achieves up to 98.2% recall on Llama.

Theorems & Definitions (14)

Definition 3.1: Spectral Entropy
Definition 3.2: Fiedler Value
Definition 3.3: Smoothness
Definition 3.4: High Frequency Energy Ratio
Definition D.1: Combinatorial Laplacian
Definition D.2: Normalized Laplacians
Proposition D.3: Laplacian Properties
proof
Definition D.4: Fiedler Value and Vector
Theorem D.5: Fiedler, 1973
...and 4 more

Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology

TL;DR

Abstract

Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (14)