A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications
Boyang Yang, Zijian Cai, Fengling Liu, Bach Le, Lingming Zhang, Tegawendé F. Bissyandé, Yang Liu, Haoye Tian
TL;DR
This survey introduces a unified design space for LLM-based automated program repair, organizing systems into four paradigms (fine-tuning, prompting, procedural pipelines, and agentic frameworks) with Retrieval Augmented Generation (RAG) and Analysis Augmented Generation (AAG) as orthogonal layers. It maps 62 representative systems, consolidates benchmark usage and evaluation protocols, and analyzes deployment scenarios, revealing trade-offs in task scope, cost, and controllability. The authors provide an auditable, living artifact pipeline for literature retrieval and coding decisions, and offer design guidelines to advance semantic correctness, multi-hunk repair, and repository-scale repair in practical CI contexts. By synthesizing methodology, augmentation strategies, and evaluation practices, the work charts concrete directions for robust, scalable, and trustworthy LLM-based software repair. This framework is poised to guide researchers and practitioners toward standardized benchmarks, cost-aware pipelines, and integrated retrieval/analysis strategies for real-world software maintenance.
Abstract
Large language models (LLMs) are reshaping automated program repair. We present a unified taxonomy that groups 62 recent LLM-based repair systems into four paradigms defined by parameter adaptation and control authority over the repair loop, and overlays two cross-cutting layers for retrieval and analysis augmentation. Prior surveys have either focused on classical software repair techniques, on LLMs in software engineering more broadly, or on subsets of LLM-based software repair, such as fine-tuning strategies or vulnerability repair. We complement these works by treating fine-tuning, prompting, procedural pipelines, and agentic frameworks as first-class paradigms and systematically mapping representative systems to each of these paradigms. We also consolidate evaluation practice on common benchmarks by recording benchmark scope, pass@k, and fault-localization assumptions to support a more meaningful comparison of reported success rates. We clarify trade-offs among paradigms in task alignment, deployment cost, controllability, and ability to repair multi-hunk or cross-file bugs. We discuss challenges in current LLM-based software repair and outline research directions. Our artifacts, including the representation papers and scripted survey pipeline, are publicly available at https://github.com/GLEAM-Lab/ProgramRepair.
