Table of Contents
Fetching ...

When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz

TL;DR

The paper investigates how long-context windows and tool-use in LLM agents affect capability and safety. Using the AgentHarm framework with varied context paddings and two task-positions across multiple models, it reveals sharp degradation in performance around 100K tokens despite claimed large context windows, and highly variable, sometimes counterintuitive, shifts in refusal behavior. It also shows that padding type and padding position significantly shape both capability and safety outcomes, challenging assumptions that increasing context length inherently improves robustness. The findings highlight safety risks and measurement gaps for long-context LLM agents and motivate more holistic evaluation frameworks and guardrails for real-world deployments.

Abstract

Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

TL;DR

The paper investigates how long-context windows and tool-use in LLM agents affect capability and safety. Using the AgentHarm framework with varied context paddings and two task-positions across multiple models, it reveals sharp degradation in performance around 100K tokens despite claimed large context windows, and highly variable, sometimes counterintuitive, shifts in refusal behavior. It also shows that padding type and padding position significantly shape both capability and safety outcomes, challenging assumptions that increasing context length inherently improves robustness. The findings highlight safety risks and measurement gaps for long-context LLM agents and motivate more holistic evaluation frameworks and guardrails for real-world deployments.

Abstract

Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from 5\% to 40\% while Grok 4 Fast decreases from 80\% to 10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

Paper Structure

This paper contains 29 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of our AgentHarm evaluation with after context padding (padding inserted immediately after the task description). We also consider a before position, where padding is inserted between the system prompt and the task. Unless noted, we report on the simplest public test subset: tasks with both a hint and detailed instructions that all models solve without padding. This lets us focus on behavior under padding instead of raw capability. Scoring combines two binary judges (refusal and semantic) with a rubric-based Harm Score that checks criteria such as whether all target tools were invoked and in the correct order. For this paper we focus on Harm Score and refusal rate as our main metrics.
  • Figure 2: AgentHarm score for benign and harmful tasks (top) and refusal rate (bottom) as random padding length increases, with padding placed after the task. Error bars denote $1\sigma$.
  • Figure 3: Performance comparison across different padding types (Random, Multi-task, Non-relevant, Relevant) for harmful tasks, with padding placed after the task. Shows GPT-4.1-nano and Grok 4 Fast. Error bars denote $1\sigma$.
  • Figure 4: Impact of padding position on harmful task performance: before vs. after the task description, for GPT-4.1-nano and Grok 4 Fast with random padding. Error bars denote $1\sigma$.
  • Figure 5: Random padding results for GPT-4.1-nano for different subsets of AgentHarm. These subsets have the same tasks and only differ in whether the task uses hint and detailed prompt. (a) Harm score for harmful tasks (b) Refusal rate for harmful tasks (c) Score for benign tasks
  • ...and 6 more figures