The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management
Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, Yaroslav Zharov
TL;DR
The paper tackles the escalating cost of maintaining large context histories in LLM-driven SWE agents. It systematically compares simple environment observation masking, sophisticated llm-based trajectory summarization, and a novel hybrid approach across diverse model configurations and agent scaffolds, including OpenHands generalization. The results show that Observation Masking often halves per-trajectory cost relative to the Raw Agent and frequently matches or exceeds the solve rate of llm-Summary, while the llm-Summary approach does not consistently outperform masking and can elongate trajectories, increasing cost. A hybrid strategy that combines masking with selective summarization achieves additional cost reductions and occasional performance gains, challenging the push toward pure llm-based summarization and underscoring the value of efficient context-management design for scalable, sustainable AI agents. Code and data are released to support reproducibility and broader evaluation.
Abstract
Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use, a process that can result in long, expensive context histories. While state-of-the-art Software Engineering (SE) agents like OpenHands or Cursor use LLM-based summarization to tackle this issue, it is unclear whether the increased complexity offers tangible performance benefits compared to simply omitting older observations. We present a systematic comparison of these approaches within SWE-agent on SWE-bench Verified across five diverse model configurations. Moreover, we show initial evidence of our findings generalizing to the OpenHands agent scaffold. We find that a simple environment observation masking strategy halves cost relative to the raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization. Additionally, we introduce a novel hybrid approach that further reduces costs by 7% and 11% compared to just observation masking or LLM summarization, respectively. Our findings raise concerns regarding the trend towards pure LLM summarization and provide initial evidence of untapped cost reductions by pushing the efficiency-effectiveness frontier. We release code and data for reproducibility.
