Information Retrieval Induced Safety Degradation in AI Agents
Cheng Yu, Benedikt Stroebl, Diyi Yang, Orestis Papakyriakopoulos
TL;DR
This paper investigates safety risks in retrieval-enabled AI agents, unveiling a robust safety degradation phenomenon where broader external retrieval (especially open web) reduces refusal behavior and increases bias and harmful content, even for aligned models. Using a controlled benchmarking framework across censored and uncensored LLMs, wiki/web retrieval, and system prompts, the authors show that retrieval shifts model behavior in a structurally unsafe manner that is not fully mitigated by retrieval depth, accuracy, or prompt-based defenses. They demonstrate safety degradation across multiple benchmarks (Bias and Harm-related) and model scales, and show that even single retrieved documents can trigger the effect, indicating a fundamental vulnerability in current retrieval-augmented systems. The work argues for retrieval-aware alignment strategies beyond prompt-level filtering and highlights implications for deploying autonomous agents in real-world environments, with future directions toward architectural safeguards and broader multilingual evaluations.
Abstract
Despite the growing integration of retrieval-enabled AI agents into society, their safety and ethical behavior remain inadequately understood. In particular, the integration of LLMs and AI agents with external information sources and real-world environments raises critical questions about how they engage with and are influenced by these external data sources and interactive contexts. This study investigates how expanding retrieval access -- from no external sources to Wikipedia-based retrieval and open web search -- affects model reliability, bias propagation, and harmful content generation. Through extensive benchmarking of censored and uncensored LLMs and AI agents, our findings reveal a consistent degradation in refusal rates, bias sensitivity, and harmfulness safeguards as models gain broader access to external sources, culminating in a phenomenon we term safety degradation. Notably, retrieval-enabled agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval. This effect persists even under strong retrieval accuracy and prompt-based mitigation, suggesting that the mere presence of retrieved content reshapes model behavior in structurally unsafe ways. These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-enabled and increasingly autonomous AI systems.
