Table of Contents
Fetching ...

From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

Tobias Lindenbauer, Georg Groh, Hinrich Schütze

TL;DR

CTIM-Rover investigates whether a repository-integrated episodic memory (CTIM) can improve software engineering agents by leveraging cross-task experiences. The approach adapts ExpeL-style experiential learning and a Mixture-of-Experts distillation to produce both general and repository-level CTIM, integrated with AutoCodeRover. Across experiments on SWE-bench Verified, CTIM-Rover fails to outperform the baseline and is shown to degrade performance due to noisy CTIM items and suboptimal exemplar usage. The work highlights the fragility of memory-augmented agents in real-world SE tasks and points to embedding-based retrieval at each turn to mitigate noise and improve relevance.

Abstract

We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM. We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.

From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

TL;DR

CTIM-Rover investigates whether a repository-integrated episodic memory (CTIM) can improve software engineering agents by leveraging cross-task experiences. The approach adapts ExpeL-style experiential learning and a Mixture-of-Experts distillation to produce both general and repository-level CTIM, integrated with AutoCodeRover. Across experiments on SWE-bench Verified, CTIM-Rover fails to outperform the baseline and is shown to degrade performance due to noisy CTIM items and suboptimal exemplar usage. The work highlights the fragility of memory-augmented agents in real-world SE tasks and points to embedding-based retrieval at each turn to mitigate noise and improve relevance.

Abstract

We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM. We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.

Paper Structure

This paper contains 22 sections, 24 figures, 3 tables.

Figures (24)

  • Figure 1: ctim-Rover Overview. Figure inspired by ExpeL zhao_expel_2024. ctim-Rover first gathers new experiences on the train set of SWE-bench Verified which we introduce in Section \ref{['sec:dataset']} (details in Appendix \ref{['sec:appendix:dataset']}). Then, it combines these experiences with existing experiences of AutoCodeRover zhang_autocoderover_2024 on SWE-bench Lite jimenez_swe-bench_2023. Next, it distills high-level and repository-level knowledge from these experiences. During evaluation, it recalls a past experience and conditions on the distilled knowledge. Key departures from ExpeL or AutoCodeRover in blue: (A) We extend AutoCodeRover with Reflexion shinn_reflexion_2023, allowing the agent to retry an instance up to three times while learning from its mistakes through self-reflection. (B) Compared to ExpeL, we also source experiences from past successful trajectories outside our system. (C) We introduce a novel domain-specific kd phase (Figure \ref{['fig:repo-distill']}) that extracts repository-level insights (e.g., common bug patterns).
  • Figure 2: CITM-Rover kd. Key departure from ExpeL zhao_expel_2024 in blue.Top: (1) Distill generally applicable swe knowledge from pairs of successful trajectories from different task instances and (2) tuples of a successful task instance and its self-reflection retries. Bottom: (3) Use the generally applicable knowledge and past experience to distill repository-level knowledge from pairs of successful trajectories from different task instances within the same repository and (4) tuples of a successful task instance and its self-reflection retries for a given repository.
  • Figure 3: Excerpt of the repository-level ctim item that biased our system toward investigating the incorrect clean function, demonstrating how seemingly innocuous knowledge can misguide the agent.
  • Figure 4: The distribution of repositories across our train and test sets.
  • Figure 5: The distribution of repositories across successful solved instanced by ctim-Rover on our train split and by AutoCodeRover on SWE-bench Lite jimenez_swe-bench_2023. In total there are 236 solved instances based on which we create our ctim.
  • ...and 19 more figures