Table of Contents
Fetching ...

AgentStepper: Interactive Debugging of Software Development Agents

Robert Hutter, Michael Pradel

TL;DR

AgentStepper introduces the first interactive debugger for LLM-based software development agents, addressing the challenge of understanding and debugging agents that operate through cycles of prompts, tool calls, and code changes. By representing trajectories as structured conversations, providing breakpoints and live editing, and visualizing repository-level changes, it raises the debugging abstraction to high-level agent actions. Empirical evaluation shows modest integration effort across three state-of-the-art agents and a user study where AgentStepper improved trajectory comprehension and bug identification while reducing workload. The work offers a practical path toward more reliable, trustworthy software development agents and makes its data and code publicly available for further research.

Abstract

Software development agents powered by large language models (LLMs) have shown great promise in automating tasks like environment setup, issue solving, and program repair. Unfortunately, understanding and debugging such agents remain challenging due to their complex and dynamic nature. Developers must reason about trajectories of LLM queries, tool calls, and code modifications, but current techniques reveal little of this intermediate process in a comprehensible format. The key insight of this paper is that debugging software development agents shares many similarities with conventional debugging of software programs, yet requires a higher level of abstraction that raises the level from low-level implementation details to high-level agent actions. Drawing on this insight, we introduce AgentStepper, the first interactive debugger for LLM-based software engineering agents. AgentStepper enables developers to inspect, control, and interactively manipulate agent trajectories. AgentStepper represents trajectories as structured conversations among an LLM, the agent program, and tools. It supports breakpoints, stepwise execution, and live editing of prompts and tool invocations, while capturing and displaying intermediate repository-level code changes. Our evaluation applies AgentStepper to three state-of-the-art software development agents, ExecutionAgent, SWE-Agent, and RepairAgent, showing that integrating the approach into existing agents requires minor code changes (39-42 edited lines). Moreover, we report on a user study with twelve participants, indicating that AgentStepper improves the ability of participants to interpret trajectories (64% vs. 67% mean performance) and identify bugs in the agent's implementation (17% vs. 60% success rate), while reducing perceived workload (e.g., frustration reduced from 5.4/7.0 to 2.4/7.0) compared to conventional tools.

AgentStepper: Interactive Debugging of Software Development Agents

TL;DR

AgentStepper introduces the first interactive debugger for LLM-based software development agents, addressing the challenge of understanding and debugging agents that operate through cycles of prompts, tool calls, and code changes. By representing trajectories as structured conversations, providing breakpoints and live editing, and visualizing repository-level changes, it raises the debugging abstraction to high-level agent actions. Empirical evaluation shows modest integration effort across three state-of-the-art agents and a user study where AgentStepper improved trajectory comprehension and bug identification while reducing workload. The work offers a practical path toward more reliable, trustworthy software development agents and makes its data and code publicly available for further research.

Abstract

Software development agents powered by large language models (LLMs) have shown great promise in automating tasks like environment setup, issue solving, and program repair. Unfortunately, understanding and debugging such agents remain challenging due to their complex and dynamic nature. Developers must reason about trajectories of LLM queries, tool calls, and code modifications, but current techniques reveal little of this intermediate process in a comprehensible format. The key insight of this paper is that debugging software development agents shares many similarities with conventional debugging of software programs, yet requires a higher level of abstraction that raises the level from low-level implementation details to high-level agent actions. Drawing on this insight, we introduce AgentStepper, the first interactive debugger for LLM-based software engineering agents. AgentStepper enables developers to inspect, control, and interactively manipulate agent trajectories. AgentStepper represents trajectories as structured conversations among an LLM, the agent program, and tools. It supports breakpoints, stepwise execution, and live editing of prompts and tool invocations, while capturing and displaying intermediate repository-level code changes. Our evaluation applies AgentStepper to three state-of-the-art software development agents, ExecutionAgent, SWE-Agent, and RepairAgent, showing that integrating the approach into existing agents requires minor code changes (39-42 edited lines). Moreover, we report on a user study with twelve participants, indicating that AgentStepper improves the ability of participants to interpret trajectories (64% vs. 67% mean performance) and identify bugs in the agent's implementation (17% vs. 60% success rate), while reducing perceived workload (e.g., frustration reduced from 5.4/7.0 to 2.4/7.0) compared to conventional tools.
Paper Structure (38 sections, 5 figures, 3 tables)

This paper contains 38 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the approach. The upper part of the figure (AgentStepper) is this paper's contribution.
  • Figure 2: User interface of AgentStepper. Part A is a panel to select agent runs. Part B shows the structured conversation view with breakpoints and stepping controls. Part C displays repository-level code changes.
  • Figure 3: Minimal agent program using the AgentStepper API. Code with gray background shows API calls.
  • Figure 4: RQ2 results for trajectory comprehension task (left) and the two bug localization tasks (middle and right).
  • Figure 5: RQ3 results showing NASA TLX scores for the three tasks (lower = better).