Table of Contents
Fetching ...

Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

Noor Nashid, Daniel Ding, Keheliya Gallaba, Ahmed E. Hassan, Ali Mesbah

TL;DR

This paper tackles the challenge of multi-hunk program repair by evaluating four leading LLM-driven coding agents on 372 Hunk4J bugs, analyzing 1,488 repair trajectories with fine-grained metrics for localization, repair, regression, and operational dynamics. It introduces Maple, an MCP-based context tool, to provide repository-level information and demonstrate a 30% accuracy improvement for weaker agents, notably Gemini-cli. The findings reveal substantial variation across agents (repair accuracy from $25.81\%$ to $93.28\%$) and show that accuracy declines with increasing hunk divergence and spatial dispersion, while successful repairs correlate with positive regression reduction; failed repairs are resource-intensive and time-consuming. The work advocates moving beyond simple accuracy to assess trustworthiness, resource usage, and context-aware planning, with Maple enabling more robust, context-driven automated repair in production-like settings.

Abstract

Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task. We evaluate these agents on 372 multi-hunk bugs from the Hunk4J dataset, analyzing 1,488 repair trajectories using fine-grained metrics that capture localization, repair accuracy, regression behavior, and operational dynamics. Results reveal substantial variation: repair accuracy ranges from 25.8% (Qwen Code) to 93.3% (Claude Code) and consistently declines with increasing bug dispersion and complexity. High-performing agents demonstrate superior semantic consistency, achieving positive regression reduction, whereas lower-performing agents often introduce new test failures. Notably, agents do not fail fast; failed repairs consume substantially more resources (39%-343% more tokens) and require longer execution time (43%-427%). Additionally, we developed Maple to provide agents with repository-level context. Empirical results show that Maple improves the repair accuracy of Gemini-cli by 30% through enhanced localization. By analyzing fine-grained metrics and trajectory-level analysis, this study moves beyond accuracy to explain how coding agents localize, reason, and act during multi-hunk repair.

Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

TL;DR

This paper tackles the challenge of multi-hunk program repair by evaluating four leading LLM-driven coding agents on 372 Hunk4J bugs, analyzing 1,488 repair trajectories with fine-grained metrics for localization, repair, regression, and operational dynamics. It introduces Maple, an MCP-based context tool, to provide repository-level information and demonstrate a 30% accuracy improvement for weaker agents, notably Gemini-cli. The findings reveal substantial variation across agents (repair accuracy from to ) and show that accuracy declines with increasing hunk divergence and spatial dispersion, while successful repairs correlate with positive regression reduction; failed repairs are resource-intensive and time-consuming. The work advocates moving beyond simple accuracy to assess trustworthiness, resource usage, and context-aware planning, with Maple enabling more robust, context-driven automated repair in production-like settings.

Abstract

Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task. We evaluate these agents on 372 multi-hunk bugs from the Hunk4J dataset, analyzing 1,488 repair trajectories using fine-grained metrics that capture localization, repair accuracy, regression behavior, and operational dynamics. Results reveal substantial variation: repair accuracy ranges from 25.8% (Qwen Code) to 93.3% (Claude Code) and consistently declines with increasing bug dispersion and complexity. High-performing agents demonstrate superior semantic consistency, achieving positive regression reduction, whereas lower-performing agents often introduce new test failures. Notably, agents do not fail fast; failed repairs consume substantially more resources (39%-343% more tokens) and require longer execution time (43%-427%). Additionally, we developed Maple to provide agents with repository-level context. Empirical results show that Maple improves the repair accuracy of Gemini-cli by 30% through enhanced localization. By analyzing fine-grained metrics and trajectory-level analysis, this study moves beyond accuracy to explain how coding agents localize, reason, and act during multi-hunk repair.

Paper Structure

This paper contains 35 sections, 13 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Standardized prompt template provided to all coding agents (condensed).
  • Figure 2: Overlap of fixes across coding agents
  • Figure 3: Upset-style plot showing bugs correctly repaired by the selected coding agents, grouped by spatial proximity class. Bars indicate intersections among coding agents; color segments denote proximity categories.
  • Figure 4: Faceted violin plots of average hunk divergence, grouped by coding agents and split by outcome.
  • Figure 5: Regression reduction across coding agents
  • ...and 5 more figures