Table of Contents
Fetching ...

Information Retrieval Induced Safety Degradation in AI Agents

Cheng Yu, Benedikt Stroebl, Diyi Yang, Orestis Papakyriakopoulos

TL;DR

This paper investigates safety risks in retrieval-enabled AI agents, unveiling a robust safety degradation phenomenon where broader external retrieval (especially open web) reduces refusal behavior and increases bias and harmful content, even for aligned models. Using a controlled benchmarking framework across censored and uncensored LLMs, wiki/web retrieval, and system prompts, the authors show that retrieval shifts model behavior in a structurally unsafe manner that is not fully mitigated by retrieval depth, accuracy, or prompt-based defenses. They demonstrate safety degradation across multiple benchmarks (Bias and Harm-related) and model scales, and show that even single retrieved documents can trigger the effect, indicating a fundamental vulnerability in current retrieval-augmented systems. The work argues for retrieval-aware alignment strategies beyond prompt-level filtering and highlights implications for deploying autonomous agents in real-world environments, with future directions toward architectural safeguards and broader multilingual evaluations.

Abstract

Despite the growing integration of retrieval-enabled AI agents into society, their safety and ethical behavior remain inadequately understood. In particular, the integration of LLMs and AI agents with external information sources and real-world environments raises critical questions about how they engage with and are influenced by these external data sources and interactive contexts. This study investigates how expanding retrieval access -- from no external sources to Wikipedia-based retrieval and open web search -- affects model reliability, bias propagation, and harmful content generation. Through extensive benchmarking of censored and uncensored LLMs and AI agents, our findings reveal a consistent degradation in refusal rates, bias sensitivity, and harmfulness safeguards as models gain broader access to external sources, culminating in a phenomenon we term safety degradation. Notably, retrieval-enabled agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval. This effect persists even under strong retrieval accuracy and prompt-based mitigation, suggesting that the mere presence of retrieved content reshapes model behavior in structurally unsafe ways. These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-enabled and increasingly autonomous AI systems.

Information Retrieval Induced Safety Degradation in AI Agents

TL;DR

This paper investigates safety risks in retrieval-enabled AI agents, unveiling a robust safety degradation phenomenon where broader external retrieval (especially open web) reduces refusal behavior and increases bias and harmful content, even for aligned models. Using a controlled benchmarking framework across censored and uncensored LLMs, wiki/web retrieval, and system prompts, the authors show that retrieval shifts model behavior in a structurally unsafe manner that is not fully mitigated by retrieval depth, accuracy, or prompt-based defenses. They demonstrate safety degradation across multiple benchmarks (Bias and Harm-related) and model scales, and show that even single retrieved documents can trigger the effect, indicating a fundamental vulnerability in current retrieval-augmented systems. The work argues for retrieval-aware alignment strategies beyond prompt-level filtering and highlights implications for deploying autonomous agents in real-world environments, with future directions toward architectural safeguards and broader multilingual evaluations.

Abstract

Despite the growing integration of retrieval-enabled AI agents into society, their safety and ethical behavior remain inadequately understood. In particular, the integration of LLMs and AI agents with external information sources and real-world environments raises critical questions about how they engage with and are influenced by these external data sources and interactive contexts. This study investigates how expanding retrieval access -- from no external sources to Wikipedia-based retrieval and open web search -- affects model reliability, bias propagation, and harmful content generation. Through extensive benchmarking of censored and uncensored LLMs and AI agents, our findings reveal a consistent degradation in refusal rates, bias sensitivity, and harmfulness safeguards as models gain broader access to external sources, culminating in a phenomenon we term safety degradation. Notably, retrieval-enabled agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval. This effect persists even under strong retrieval accuracy and prompt-based mitigation, suggesting that the mere presence of retrieved content reshapes model behavior in structurally unsafe ways. These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-enabled and increasingly autonomous AI systems.

Paper Structure

This paper contains 43 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Core Evaluation Framework We compare three settings: (1) censored LLMs without retrieval, (2) censored LLM agents retrieving from Wikipedia or Web, (3) uncensored LLM variants.
  • Figure 2: Selected Refusal Rate Results for Illustration. For clarity, we show refusal rates on a representative subset of benchmarks and models under different agent configurations, with and without safe prompting. Full results are in Appendix \ref{['app:figures_full']}.
  • Figure 3: Selected Bias Scores for Illustration. Bias scores (mean $\pm$ 95% confidence interval) on the 2 bias benchmarks. Scores range from 1 to 5, with lower values reflecting stronger alignment with stereotypical content. Full results for all models and benchmarks are in Appendix \ref{['app:figures_full']}.
  • Figure 4: Selected Safety Scores for Illustration. Safety scores (mean $\pm$ 95% confidence interval) on the XSTest-v2 and SafeArena benchmarks. Scores range from 1 to 5. Higher scores indicate more helpful, appropriate, and safety-aligned responses. Comprehensive results in Appendix \ref{['app:figures_full']}.
  • Figure 5: Impact of Retrieval Configuration on Safety Metrics. Refusal rates (left) and bias/safety scores (right) across four benchmarks—BBQ, AIR-Bench, XSTest-v2, and SafeArena—for standard LLaMA3.2 and three WikiAgent configurations.
  • ...and 3 more figures