Table of Contents
Fetching ...

Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement

Hiroaki Hayashi, Bo Pang, Wenting Zhao, Ye Liu, Akash Gokul, Srijan Bansal, Caiming Xiong, Semih Yavuz, Yingbo Zhou

TL;DR

Self-Abstraction from Grounded Experience (SAGE) tackles the limited adaptability of LLM-based software engineering agents operating in static execution frameworks by enabling test-time learning from grounded rollouts. It defines a three-stage loop—Exploration, Plan Abstraction, and Plan-Augmented Execution—where a concise plan $\psi$ is induced from the grounded trajectory and used to condition subsequent policy execution within an MDP $\mathcal{M}= (\mathcal{S},\mathcal{A},T,R,\gamma)$. Across SWE-Bench Verified, SAGE yields consistent gains across backbones (e.g., GPT-5, Gemini, Claude) and frameworks, achieving up to approximately 74% Pass@1 and notable relative improvements over baselines. The work extends to plan attribution analysis, the impact of bug localization, and connections to reinforcement learning formalisms such as Semi-Markov Decision Processes and Bayesian RL, demonstrating SAGE as a general, plug-in test-time adaptation framework for LLM-based software engineering agents.

Abstract

Large language model (LLM) based agents are increasingly used to tackle software engineering tasks that require multi-step reasoning and code modification, demonstrating promising yet limited performance. However, most existing LLM agents typically operate within static execution frameworks, lacking a principled mechanism to learn and self-improve from their own experience and past rollouts. As a result, their performance remains bounded by the initial framework design and the underlying LLM's capabilities. We propose Self-Abstraction from Grounded Experience (SAGE), a framework that enables agents to learn from their own task executions and refine their behavior through self-abstraction. After an initial rollout, the agent induces a concise plan abstraction from its grounded experience, distilling key steps, dependencies, and constraints. This learned abstraction is then fed back as contextual guidance, refining the agent's policy and supporting more structured, informed subsequent executions. Empirically, SAGE delivers consistent performance gains across diverse LLM backbones and agent architectures. Notably, it yields a 7.2% relative performance improvement over the strong Mini-SWE-Agent baseline when paired with the GPT-5 (high) backbone. SAGE further achieves strong overall performance on SWE-Bench Verified benchmark, reaching 73.2% and 74% Pass@1 resolve rates with the Mini-SWE-Agent and OpenHands CodeAct agent framework, respectively.

Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement

TL;DR

Self-Abstraction from Grounded Experience (SAGE) tackles the limited adaptability of LLM-based software engineering agents operating in static execution frameworks by enabling test-time learning from grounded rollouts. It defines a three-stage loop—Exploration, Plan Abstraction, and Plan-Augmented Execution—where a concise plan is induced from the grounded trajectory and used to condition subsequent policy execution within an MDP . Across SWE-Bench Verified, SAGE yields consistent gains across backbones (e.g., GPT-5, Gemini, Claude) and frameworks, achieving up to approximately 74% Pass@1 and notable relative improvements over baselines. The work extends to plan attribution analysis, the impact of bug localization, and connections to reinforcement learning formalisms such as Semi-Markov Decision Processes and Bayesian RL, demonstrating SAGE as a general, plug-in test-time adaptation framework for LLM-based software engineering agents.

Abstract

Large language model (LLM) based agents are increasingly used to tackle software engineering tasks that require multi-step reasoning and code modification, demonstrating promising yet limited performance. However, most existing LLM agents typically operate within static execution frameworks, lacking a principled mechanism to learn and self-improve from their own experience and past rollouts. As a result, their performance remains bounded by the initial framework design and the underlying LLM's capabilities. We propose Self-Abstraction from Grounded Experience (SAGE), a framework that enables agents to learn from their own task executions and refine their behavior through self-abstraction. After an initial rollout, the agent induces a concise plan abstraction from its grounded experience, distilling key steps, dependencies, and constraints. This learned abstraction is then fed back as contextual guidance, refining the agent's policy and supporting more structured, informed subsequent executions. Empirically, SAGE delivers consistent performance gains across diverse LLM backbones and agent architectures. Notably, it yields a 7.2% relative performance improvement over the strong Mini-SWE-Agent baseline when paired with the GPT-5 (high) backbone. SAGE further achieves strong overall performance on SWE-Bench Verified benchmark, reaching 73.2% and 74% Pass@1 resolve rates with the Mini-SWE-Agent and OpenHands CodeAct agent framework, respectively.

Paper Structure

This paper contains 29 sections, 6 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: SAGE consists of three stages. (1) Exploration: the agent attempts to finish the given task. (2) Plan Abstraction: an agent critically analyze and suggest a high-level plan based on the exploration trajectory. (3) Plan-augmented execution: an agent attempts to finish the task again with access to the high-level plan. Each stage can be instantiated with same or different LLM backbones & agents. We match the settings for Exploration and Plan-augmented execution throughout the experiments, while investigating different LLM backbones for Plan abstraction.
  • Figure 2: Influence of candidate diversity on ensemble reasoning performance (Pass@1). Increasing candidate diversity improves the Best-of-N (Oracle) score (up to 83.4%) while lowering the Worst-of-N (Adversary) (to 52.2%), with mean performance remaining stable around 70%. The LLM-as-a-judge ensemble achieves 74.6%, outperforming the average individual model.
  • Figure 3: Issue Description (psf__requests-2931) used in the rest of Case Study.
  • Figure 4: Illustration of introspective plan induction and its downstream realization in code based on a real example from SWE Bench Verified improved by SAGE approach.
  • Figure 5: Distribution of how often SAGE-induced plans surface itself in the code changes between initial and SAGE-generated patches, indicating the correlation between the induced plan and the outcome changes. A SAGE-generated patch is considered to have a non-empty plan when at least one line in the code change can be attributed to at least one element in the plan.
  • ...and 1 more figures