Is Reasoning Capability Enough for Safety in Long-Context Language Models?

Yu Fu; Haz Sameen Shahgir; Huanli Gong; Zhipeng Wei; N. Benjamin Erichson; Yue Dong

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

Yu Fu, Haz Sameen Shahgir, Huanli Gong, Zhipeng Wei, N. Benjamin Erichson, Yue Dong

TL;DR

This work questions the assumption that enhanced reasoning in long-context LLMs improves safety. By introducing compositional reasoning attacks and evaluating 14 frontier models up to 64k tokens, the authors find that safety alignment often declines as reasoning becomes necessary and context length grows, even when retrieval remains effective. A central finding is that increasing inference-time reasoning can substantially bolster safety, indicating that the model’s harm-recognition capability exists but is not consistently activated by default. The results reveal a safety–efficiency trade-off and call for adaptive safety mechanisms that scale with both reasoning depth and context length in real-world deployments.

Abstract

Large language models (LLMs) increasingly combine long-context processing with advanced reasoning, enabling them to retrieve and synthesize information distributed across tens of thousands of tokens. A hypothesis is that stronger reasoning capability should improve safety by helping models recognize harmful intent even when it is not stated explicitly. We test this hypothesis in long-context settings where harmful intent is implicit and must be inferred through reasoning, and find that it does not hold. We introduce compositional reasoning attacks, a new threat model in which a harmful query is decomposed into incomplete fragments that scattered throughout a long context. The model is then prompted with a neutral reasoning query that induces retrieval and synthesis, causing the harmful intent to emerge only after composition. Evaluating 14 frontier LLMs on contexts up to 64k tokens, we uncover three findings: (1) models with stronger general reasoning capability are not more robust to compositional reasoning attacks, often assembling the intent yet failing to refuse; (2) safety alignment consistently degrades as context length increases; and (3) inference-time reasoning effort is a key mitigating factor: increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model. Together, these results suggest that safety does not automatically scale with reasoning capability, especially under long-context inference.

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

TL;DR

Abstract

Paper Structure (29 sections, 7 figures, 2 tables)

This paper contains 29 sections, 7 figures, 2 tables.

Introduction
Methodology
Threat Model
Reasoning Types
Benchmark Construction
Evaluation Protocol
Experiment
Experimental Settings
Main Results
Reasoning Analysis
Reasoning Effort Improves Safety
Task Complexity Analysis
Long-Context Analysis
Context Relevant Analysis
Needle Position Analysis
...and 14 more sections

Figures (7)

Figure 1: Safety behavior of 14 LLMs under compositional reasoning attacks. The x-axis shows safety loss when moving from direct retrieval to decomposed reasoning-based attacks, and the y-axis shows safety loss from short to long contexts. Models cluster into distinct regimes, including systems that are relatively robust to decomposition but degrade in long contexts (e.g., GPT and Claude families), and systems that degrade under both decomposition and long contexts (e.g., Gemini-3 and DeepSeek). No LLM achieves both low decomposition sensitivity and stable long-context safety, corresponding to the lower-left robustness regime. Circle color indicates the model's maximum longer query capacity, and circle size indicates math-reasoning performance (see Appendix \ref{['app:hyperparameters']}).
Figure 2: Radar chart comparing model safety performance across reasoning types at 64k context length. Axes correspond to Direct Retrieval, Single-hop Aggregation, Chain Reasoning, and Multi-hop Deductive Reasoning; values indicate safety ratio (%).
Figure 3: Overview of the three reasoning types in our benchmark. All types use the same neutral trigger query, but differ in how harmful fragments are distributed and how their relationships must be inferred. Single-hop Aggregation: two independent fragments are combined via simple aggregation. Chain Reasoning: three fragments form a sequential chain with explicit cross-references. Multi-hop Deductive Reasoning: four fragments require the model to infer an implicit bridge concept $X$ before synthesizing the final answer.
Figure 4: Impact of context length and inference-time reasoning effort on safety for GPT-oss-120b (results averaged over Type 1-3). Safety degrades sharply as context length increases (top), with high-effort reasoning partially mitigating long-context failures. Increasing reasoning effort yields substantial safety improvements (bottom), but requires orders-of-magnitude more reasoning tokens, highlighting a fundamental safety–efficiency trade-off.
Figure 5: Impact of inference-time reasoning effort on safety across reasoning types for GPT-oss-120b. Higher reasoning effort improves safety for all types, with Multi-hop Deductive Reasoning showing the largest relative gain (+277% from Low to High).
...and 2 more figures

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

TL;DR

Abstract

Is Reasoning Capability Enough for Safety in Long-Context Language Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (7)