Table of Contents
Fetching ...

The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management

Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, Yaroslav Zharov

TL;DR

The paper tackles the escalating cost of maintaining large context histories in LLM-driven SWE agents. It systematically compares simple environment observation masking, sophisticated llm-based trajectory summarization, and a novel hybrid approach across diverse model configurations and agent scaffolds, including OpenHands generalization. The results show that Observation Masking often halves per-trajectory cost relative to the Raw Agent and frequently matches or exceeds the solve rate of llm-Summary, while the llm-Summary approach does not consistently outperform masking and can elongate trajectories, increasing cost. A hybrid strategy that combines masking with selective summarization achieves additional cost reductions and occasional performance gains, challenging the push toward pure llm-based summarization and underscoring the value of efficient context-management design for scalable, sustainable AI agents. Code and data are released to support reproducibility and broader evaluation.

Abstract

Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use, a process that can result in long, expensive context histories. While state-of-the-art Software Engineering (SE) agents like OpenHands or Cursor use LLM-based summarization to tackle this issue, it is unclear whether the increased complexity offers tangible performance benefits compared to simply omitting older observations. We present a systematic comparison of these approaches within SWE-agent on SWE-bench Verified across five diverse model configurations. Moreover, we show initial evidence of our findings generalizing to the OpenHands agent scaffold. We find that a simple environment observation masking strategy halves cost relative to the raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization. Additionally, we introduce a novel hybrid approach that further reduces costs by 7% and 11% compared to just observation masking or LLM summarization, respectively. Our findings raise concerns regarding the trend towards pure LLM summarization and provide initial evidence of untapped cost reductions by pushing the efficiency-effectiveness frontier. We release code and data for reproducibility.

The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management

TL;DR

The paper tackles the escalating cost of maintaining large context histories in LLM-driven SWE agents. It systematically compares simple environment observation masking, sophisticated llm-based trajectory summarization, and a novel hybrid approach across diverse model configurations and agent scaffolds, including OpenHands generalization. The results show that Observation Masking often halves per-trajectory cost relative to the Raw Agent and frequently matches or exceeds the solve rate of llm-Summary, while the llm-Summary approach does not consistently outperform masking and can elongate trajectories, increasing cost. A hybrid strategy that combines masking with selective summarization achieves additional cost reductions and occasional performance gains, challenging the push toward pure llm-based summarization and underscoring the value of efficient context-management design for scalable, sustainable AI agents. Code and data are released to support reproducibility and broader evaluation.

Abstract

Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use, a process that can result in long, expensive context histories. While state-of-the-art Software Engineering (SE) agents like OpenHands or Cursor use LLM-based summarization to tackle this issue, it is unclear whether the increased complexity offers tangible performance benefits compared to simply omitting older observations. We present a systematic comparison of these approaches within SWE-agent on SWE-bench Verified across five diverse model configurations. Moreover, we show initial evidence of our findings generalizing to the OpenHands agent scaffold. We find that a simple environment observation masking strategy halves cost relative to the raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization. Additionally, we introduce a novel hybrid approach that further reduces costs by 7% and 11% compared to just observation masking or LLM summarization, respectively. Our findings raise concerns regarding the trend towards pure LLM summarization and provide initial evidence of untapped cost reductions by pushing the efficiency-effectiveness frontier. We release code and data for reproducibility.

Paper Structure

This paper contains 32 sections, 7 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Environment observation tokens dominate the context window of an se agent's trajectory.
  • Figure 2: The effectiveness versus efficiency tradeoff for context management strategies within SWE-agent yang_swe-agent_2024 on SWE-bench Verified openai_swe_bench_verified_2024. The plot compares solve rate (y-axis, $\uparrow$) against the average cost per trajectory (x-axis, $\downarrow$) for different model configurations. We test each configuration with three strategies: Raw Agent (baseline, ●), llm-Summary (■), and Observation Masking (▲). Across all models, the Observation Masking strategy consistently occupies the most efficient frontier, achieving solve rates competitive with, and sometimes superior to, the llm-Summary strategy. With Qwen3-Coder 480B qwen3technicalreporthui2024qwen2, the best-performing model in our experiments, Observation Masking is not only 52% cheaper than the Raw Agent baseline but also improves on the solve rate by 2.6 %. Moreover, it even reduces the cost per instance compared to llm-Summary by $0.03 ($15 across 500 instances).
  • Figure 3: Overview of the context management strategies evaluated in our work. Box heights indicate the number of tokens in that portion of a typical trajectory. (a) The baseline ReAct-style yao_react_2023 trajectory, where the context grows with each action-observation pair. (b)LLM-based summarization condenses older turns into a running summary and preserving a few recent turns in full (e.g., OpenHands wang2025openhands). (c)Observation masking replaces observations older than a fixed window of M turns (here, M=1) with a placeholder (e.g., SWE-agent yang_swe-agent_2024).
  • Figure 4: Impact of context management strategies on trajectory length. Box plots show the distribution of trajectory lengths (in turns) across different strategies within SWE-agent yang_swe-agent_2024. llm-Summary consistently leads to longer trajectories, suggesting they mask failure signals that would otherwise prompt earlier termination. The star indicates the mean trajectory length.
  • Figure 5: (a) Probing the generality of our findings with OpenHands wang2025openhands on the SWE-bench Verified-50 badertdinov2024scaling subset. After appropriately tuning the rolling window size $M$ to the agent scaffold, Observation Masking again matches the performance of llm-Summary on both cost and solve rate. (b) Our novel hybrid Observation Masking and LLM-Summary approach on SWE-bench Verified-50 badertdinov2024scaling on the SWE-agent scaffold using Qwen3-Coder 480B. Effectively combining the strengths of each approach results in a strategy that robustly realizes efficiency gains regardless of trajectory length while benefiting from bounded context on excessively long trajectories. Our hybrid approach yields a slight solve-rate gain of 2.6 percent points compared to the Raw Agent while reducing costs by 7% and 11% compared to Observation Masking and llm-Summary, respectively.
  • ...and 9 more figures