Table of Contents
Fetching ...

TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code

Jiangping Huang, Wenguang Ye, Weisong Sun, Jian Zhang, Mingyue Zhang, Yang Liu

TL;DR

TraceCoder tackles the brittleness of LLM-generated code by introducing a trace-driven, multi-agent debugging framework that instruments runtime behavior, reasons causally, and repairs code iteratively. It integrates Instrumentation, Analysis, and Repair Agents, augmented by a Historical Lesson Learning Mechanism and a Rollback process to ensure stable convergence. Empirical results across multiple benchmarks and backbones show substantial improvements in Pass@1 over strong baselines, with notable gains from iterative repair and component contributions validated by ablations. The work advances automated debugging by delivering fine-grained execution visibility, history-informed planning, and cost-aware repair dynamics, and it provides an open-source implementation to foster reproducibility and further research.

Abstract

Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficient cycles. To overcome these challenges, we present TraceCoder, a collaborative multi-agent framework that emulates the observe-analyze-repair process of human experts. The framework first instruments the code with diagnostic probes to capture fine-grained runtime traces, enabling deep insight into its internal execution. It then conducts causal analysis on these traces to accurately identify the root cause of the failure. This process is further enhanced by a novel Historical Lesson Learning Mechanism (HLLM), which distills insights from prior failed repair attempts to inform subsequent correction strategies and prevent recurrence of similar mistakes. To ensure stable convergence, a Rollback Mechanism enforces that each repair iteration constitutes a strict improvement toward the correct solution. Comprehensive experiments across multiple benchmarks show that TraceCoder achieves up to a 34.43\% relative improvement in Pass@1 accuracy over existing advanced baselines. Ablation studies verify the significance of each system component, with the iterative repair process alone contributing a 65.61\% relative gain in accuracy. Furthermore, TraceCoder significantly outperforms leading iterative methods in terms of both accuracy and cost-efficiency.

TraceCoder: A Trace-Driven Multi-Agent Framework for Automated Debugging of LLM-Generated Code

TL;DR

TraceCoder tackles the brittleness of LLM-generated code by introducing a trace-driven, multi-agent debugging framework that instruments runtime behavior, reasons causally, and repairs code iteratively. It integrates Instrumentation, Analysis, and Repair Agents, augmented by a Historical Lesson Learning Mechanism and a Rollback process to ensure stable convergence. Empirical results across multiple benchmarks and backbones show substantial improvements in Pass@1 over strong baselines, with notable gains from iterative repair and component contributions validated by ablations. The work advances automated debugging by delivering fine-grained execution visibility, history-informed planning, and cost-aware repair dynamics, and it provides an open-source implementation to foster reproducibility and further research.

Abstract

Large Language Models (LLMs) often generate code with subtle but critical bugs, especially for complex tasks. Existing automated repair methods typically rely on superficial pass/fail signals, offering limited visibility into program behavior and hindering precise error localization. In addition, without a way to learn from prior failures, repair processes often fall into repetitive and inefficient cycles. To overcome these challenges, we present TraceCoder, a collaborative multi-agent framework that emulates the observe-analyze-repair process of human experts. The framework first instruments the code with diagnostic probes to capture fine-grained runtime traces, enabling deep insight into its internal execution. It then conducts causal analysis on these traces to accurately identify the root cause of the failure. This process is further enhanced by a novel Historical Lesson Learning Mechanism (HLLM), which distills insights from prior failed repair attempts to inform subsequent correction strategies and prevent recurrence of similar mistakes. To ensure stable convergence, a Rollback Mechanism enforces that each repair iteration constitutes a strict improvement toward the correct solution. Comprehensive experiments across multiple benchmarks show that TraceCoder achieves up to a 34.43\% relative improvement in Pass@1 accuracy over existing advanced baselines. Ablation studies verify the significance of each system component, with the iterative repair process alone contributing a 65.61\% relative gain in accuracy. Furthermore, TraceCoder significantly outperforms leading iterative methods in terms of both accuracy and cost-efficiency.
Paper Structure (45 sections, 3 equations, 3 figures, 7 tables, 2 algorithms)

This paper contains 45 sections, 3 equations, 3 figures, 7 tables, 2 algorithms.

Figures (3)

  • Figure 1: Limitations of simple execution feedback. Without runtime insights, the model repeatedly applies local patches that degrade the code's correctness, causing it to loop between incorrect versions rather than converging to a correct global solution.
  • Figure 2: Overview of TraceCoder’s workflow. ① An LLM generates an initial code solution. ② The code is executed and tested. A multi-agent debugging loop—comprising the Instrumentation, Analysis, and Repair Agents—emulates expert debugging behaviors by leveraging runtime tracing, HLLM, and RM to enable effective and stable repair. After each failed attempt, the HLLM logs the outcome and informs the Analysis Agent’s strategy for the subsequent cycle.
  • Figure 3: Sensitivity analysis of max_attempts and patience on Pass@1 performance. Results are reported on BigCodeBench-Complete using Gemini-2.5-Flash-0417.