Table of Contents
Fetching ...

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang

TL;DR

Live-SWE-agent introduces the first live software agent capable of self-evolving its own scaffold on the fly during real-world software tasks, starting from a minimal bash-only setup and focusing on on-demand tool creation. It eschews offline training and remains agnostic to the underlying LLM, enabling runtime adaptation and generalization across models. Empirical results on SWE-bench Verified and SWE-Bench Pro show state-of-the-art performance with 77.4% and 45.8% solve rates, respectively, outperforming both open-source baselines and commercial agents. The work also provides extensive tool-usage analysis and ablation studies, demonstrating that on-the-fly tool synthesis substantially boosts effectiveness and generalizes across LLM backends, with implications for unified benchmarks and future live-evolving AI systems in software engineering.

Abstract

Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

TL;DR

Live-SWE-agent introduces the first live software agent capable of self-evolving its own scaffold on the fly during real-world software tasks, starting from a minimal bash-only setup and focusing on on-demand tool creation. It eschews offline training and remains agnostic to the underlying LLM, enabling runtime adaptation and generalization across models. Empirical results on SWE-bench Verified and SWE-Bench Pro show state-of-the-art performance with 77.4% and 45.8% solve rates, respectively, outperforming both open-source baselines and commercial agents. The work also provides extensive tool-usage analysis and ablation studies, demonstrating that on-the-fly tool synthesis substantially boosts effectiveness and generalizes across LLM backends, with implications for unified benchmarks and future live-evolving AI systems in software engineering.

Abstract

Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.

Paper Structure

This paper contains 22 sections, 10 figures, 6 tables.

Figures (10)

  • Figure 1: SWE-bench Verified and SWE-Bench Pro results (single attempt w/o test-time scaling)
  • Figure 2: Overview of Live-SWE-agent
  • Figure 3: Edit tool
  • Figure 4: MARC file analyzer tool
  • Figure 6: 2 dimensional t-SNE visualization of tools generated by Claude 4.5 Sonnet on SWE-bench Verified and SWE-Bench Pro. We label and display the embedding based on tool type (Figure \ref{['fig:tool_verified']}), repository name (Figure \ref{['fig:tool_pro_repo']}), and programming language used in the repository (Figure \ref{['fig:tool_pro_language']}). Note that for Figure \ref{['fig:tool_pro_repo']}, we only label three repositories in the legend due to space considerations. The three repositories were chosen as they have representative distinct clusters.
  • ...and 5 more figures