From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents
Tobias Lindenbauer, Georg Groh, Hinrich Schütze
TL;DR
CTIM-Rover investigates whether a repository-integrated episodic memory (CTIM) can improve software engineering agents by leveraging cross-task experiences. The approach adapts ExpeL-style experiential learning and a Mixture-of-Experts distillation to produce both general and repository-level CTIM, integrated with AutoCodeRover. Across experiments on SWE-bench Verified, CTIM-Rover fails to outperform the baseline and is shown to degrade performance due to noisy CTIM items and suboptimal exemplar usage. The work highlights the fragility of memory-augmented agents in real-world SE tasks and points to embedding-based retrieval at each turn to mitigate noise and improve relevance.
Abstract
We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM. We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.
