Table of Contents
Fetching ...

Structurally Aligned Subtask-Level Memory for Software Engineering Agents

Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue

TL;DR

Structurally Aligned Subtask-Level Memory is proposed, a method that aligns memory storage, retrieval, and updating with the agent's functional decomposition and consistently outperforms both vanilla agents and strong instance-level memory baselines across diverse backbones.

Abstract

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents. Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning. However, these approaches typically operate at a coarse instance granularity, treating the entire problem-solving episode as the atomic unit of storage and retrieval. We empirically demonstrate that instance-level memory suffers from a fundamental granularity mismatch, resulting in misguided retrieval when tasks with similar surface descriptions require distinct reasoning logic at specific stages. To address this, we propose Structurally Aligned Subtask-Level Memory, a method that aligns memory storage, retrieval, and updating with the agent's functional decomposition. Extensive experiments on SWE-bench Verified demonstrate that our method consistently outperforms both vanilla agents and strong instance-level memory baselines across diverse backbones, improving mean Pass@1 over the vanilla agent by +4.7 pp on average (e.g., +6.8 pp on Gemini 2.5 Pro). Performance gains grow with more interaction steps, showing that leveraging past experience benefits long-horizon reasoning in complex software engineering tasks.

Structurally Aligned Subtask-Level Memory for Software Engineering Agents

TL;DR

Structurally Aligned Subtask-Level Memory is proposed, a method that aligns memory storage, retrieval, and updating with the agent's functional decomposition and consistently outperforms both vanilla agents and strong instance-level memory baselines across diverse backbones.

Abstract

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents. Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning. However, these approaches typically operate at a coarse instance granularity, treating the entire problem-solving episode as the atomic unit of storage and retrieval. We empirically demonstrate that instance-level memory suffers from a fundamental granularity mismatch, resulting in misguided retrieval when tasks with similar surface descriptions require distinct reasoning logic at specific stages. To address this, we propose Structurally Aligned Subtask-Level Memory, a method that aligns memory storage, retrieval, and updating with the agent's functional decomposition. Extensive experiments on SWE-bench Verified demonstrate that our method consistently outperforms both vanilla agents and strong instance-level memory baselines across diverse backbones, improving mean Pass@1 over the vanilla agent by +4.7 pp on average (e.g., +6.8 pp on Gemini 2.5 Pro). Performance gains grow with more interaction steps, showing that leveraging past experience benefits long-horizon reasoning in complex software engineering tasks.
Paper Structure (33 sections, 3 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 33 sections, 3 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of Instance-Level vs. Subtask-Level Memory.Top: Instance-level memory relies on global task similarity, causing reasoning interference when tasks share surface goals but require different reasoning logic. Bottom: Our method retrieves by matching stage-consistent subtask intents, enabling precise experience reuse even across globally dissimilar tasks.
  • Figure 2: Overview of the Structurally Aligned Subtask-Level Memory method. Unlike instance-level approaches, our method aligns memory operations with the agent's functional decomposition (e.g., Edit). The process operates in two key phases: (1) Retrieval ($R$): Initialized by the Subtask Intent, the agent queries the Memory State ($S_{\text{sub}}$). A Category Filter followed by a Similarity Match retrieves a structurally relevant historical anchor to provide Augmented Context. (2) Update ($U$): The Subtask Trajectory is processed by an Extractor to summarize abstract, transferable insights, which are stored as a New Memory Entry.
  • Figure 3: Temporal Dynamics of Experience Accumulation. Net gain ($\Delta$ Resolved) relative to the baseline across sequential bins of 100 instances. The trend illustrates the transition from a sparse memory state (1-200) to accelerated knowledge transfer (301-500) as experience accumulates.
  • Figure 4: Pass@1 Improvement by Complexity. Instances are grouped by baseline trajectory length (step count). The method delivers disproportionate gains (+8.7%) on Hard tasks ($>28$ steps), demonstrating its utility in mitigating long-horizon reasoning failures compared to the baseline.
  • Figure 5: Memory Retrieval Frequency. The distribution follows a long-tail pattern: a small "head" of generic memories is retrieved frequently, while a vast "tail" of over 400 single-use memories captures instance-specific edge cases.
  • ...and 2 more figures