Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement
Hiroaki Hayashi, Bo Pang, Wenting Zhao, Ye Liu, Akash Gokul, Srijan Bansal, Caiming Xiong, Semih Yavuz, Yingbo Zhou
TL;DR
Self-Abstraction from Grounded Experience (SAGE) tackles the limited adaptability of LLM-based software engineering agents operating in static execution frameworks by enabling test-time learning from grounded rollouts. It defines a three-stage loop—Exploration, Plan Abstraction, and Plan-Augmented Execution—where a concise plan $\psi$ is induced from the grounded trajectory and used to condition subsequent policy execution within an MDP $\mathcal{M}= (\mathcal{S},\mathcal{A},T,R,\gamma)$. Across SWE-Bench Verified, SAGE yields consistent gains across backbones (e.g., GPT-5, Gemini, Claude) and frameworks, achieving up to approximately 74% Pass@1 and notable relative improvements over baselines. The work extends to plan attribution analysis, the impact of bug localization, and connections to reinforcement learning formalisms such as Semi-Markov Decision Processes and Bayesian RL, demonstrating SAGE as a general, plug-in test-time adaptation framework for LLM-based software engineering agents.
Abstract
Large language model (LLM) based agents are increasingly used to tackle software engineering tasks that require multi-step reasoning and code modification, demonstrating promising yet limited performance. However, most existing LLM agents typically operate within static execution frameworks, lacking a principled mechanism to learn and self-improve from their own experience and past rollouts. As a result, their performance remains bounded by the initial framework design and the underlying LLM's capabilities. We propose Self-Abstraction from Grounded Experience (SAGE), a framework that enables agents to learn from their own task executions and refine their behavior through self-abstraction. After an initial rollout, the agent induces a concise plan abstraction from its grounded experience, distilling key steps, dependencies, and constraints. This learned abstraction is then fed back as contextual guidance, refining the agent's policy and supporting more structured, informed subsequent executions. Empirically, SAGE delivers consistent performance gains across diverse LLM backbones and agent architectures. Notably, it yields a 7.2% relative performance improvement over the strong Mini-SWE-Agent baseline when paired with the GPT-5 (high) backbone. SAGE further achieves strong overall performance on SWE-Bench Verified benchmark, reaching 73.2% and 74% Pass@1 resolve rates with the Mini-SWE-Agent and OpenHands CodeAct agent framework, respectively.
