Table of Contents
Fetching ...

AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

Abstract

Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.

AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

Abstract

Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.
Paper Structure (63 sections, 9 equations, 17 figures, 33 tables, 1 algorithm)

This paper contains 63 sections, 9 equations, 17 figures, 33 tables, 1 algorithm.

Figures (17)

  • Figure 1: Overview of our paired-trajectory diagnostic protocol. (A) A ReAct agent with persistent memory replays real financial dialogues in clean and contaminated sessions. (B) The perturbation probe applies four modes (risk inversion, metric manipulation, biased headlines, TQQQ injection) to tool outputs. (C) Divergence is decomposed into information-channel and memory-channel mechanisms, revealing evaluation blindness: quality metrics remain stable (UPR $\approx$ 1.0) while suitability violations and severity increase across models.
  • Figure 2: Pathway diagram. $A$ contaminates $\mathcal{O}_t$; divergence flows via a direct information channel and an indirect memory channel. Dashed edge: temporal persistence.
  • Figure 3: NDCG vs. SVR$_s$ ($1\sigma$ ellipses, 7 models).
  • Figure 4: Temporal dynamics. (a) Individual trajectories show heterogeneous but persistent drift. (b) Aggregate drift and safety violations across seven models; violations emerge at turn 1 with no robustness buffer.
  • Figure 5: Mean NDCG comparison between clean and contaminated sessions per user (Claude Sonnet 4.6). Mean UPR = 1.000.
  • ...and 12 more figures