Table of Contents
Fetching ...

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

Jiale Guo, Suizhi Huang, Mei Li, Dong Huang, Xingsheng Chen, Regina Zhang, Zhijiang Guo, Han Yu, Siu-Ming Yiu, Pietro Lio, Kwok-Yan Lam

TL;DR

This survey addresses the gap between benchmarks and solutions in LLM-empowered software engineering by proposing a two-dimensional taxonomy that separates Solutions (Prompt-based, Fine-tune-based, Agent-based) from Benchmarks (code generation, translation, repair, and others) and presenting a unified workflow from task specification to deliverables. It analyzes 150+ papers to connect 50+ benchmarks with corresponding solution strategies and introduces a pipeline that captures how agent capabilities such as planning, reasoning, memory, and tool augmentation integrate across tasks. The study identifies key gaps, including multi-agent collaboration, self-evolving systems, and formal verification integration, and outlines concrete future directions to move toward production-ready, ethically deployed, and continuously learning SE agents. The associated GitHub repository provides ongoing updates, facilitating reproducibility and community-driven progress in LLM-driven software engineering.

Abstract

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is hindered by a lack of comprehensive understanding of how benchmarks and solutions interconnect. This survey addresses this gap by providing the first holistic analysis of LLM-powered software engineering, offering insights into evaluation methodologies and solution paradigms. We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair. Our analysis highlights the evolution from simple prompt engineering to sophisticated agentic systems incorporating capabilities like planning, reasoning, memory mechanisms, and tool augmentation. To contextualize this progress, we present a unified pipeline illustrating the workflow from task specification to deliverables, detailing how different solution paradigms address various complexity levels. Unlike prior surveys that focus narrowly on specific aspects, this work connects 50+ benchmarks to their corresponding solution strategies, enabling researchers to identify optimal approaches for diverse evaluation criteria. We also identify critical research gaps and propose future directions, including multi-agent collaboration, self-evolving systems, and formal verification integration. This survey serves as a foundational guide for advancing LLM-driven software engineering. We maintain a GitHub repository that continuously updates the reviewed and related papers at https://github.com/lisaGuojl/LLM-Agent-SE-Survey.

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

TL;DR

This survey addresses the gap between benchmarks and solutions in LLM-empowered software engineering by proposing a two-dimensional taxonomy that separates Solutions (Prompt-based, Fine-tune-based, Agent-based) from Benchmarks (code generation, translation, repair, and others) and presenting a unified workflow from task specification to deliverables. It analyzes 150+ papers to connect 50+ benchmarks with corresponding solution strategies and introduces a pipeline that captures how agent capabilities such as planning, reasoning, memory, and tool augmentation integrate across tasks. The study identifies key gaps, including multi-agent collaboration, self-evolving systems, and formal verification integration, and outlines concrete future directions to move toward production-ready, ethically deployed, and continuously learning SE agents. The associated GitHub repository provides ongoing updates, facilitating reproducibility and community-driven progress in LLM-driven software engineering.

Abstract

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is hindered by a lack of comprehensive understanding of how benchmarks and solutions interconnect. This survey addresses this gap by providing the first holistic analysis of LLM-powered software engineering, offering insights into evaluation methodologies and solution paradigms. We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair. Our analysis highlights the evolution from simple prompt engineering to sophisticated agentic systems incorporating capabilities like planning, reasoning, memory mechanisms, and tool augmentation. To contextualize this progress, we present a unified pipeline illustrating the workflow from task specification to deliverables, detailing how different solution paradigms address various complexity levels. Unlike prior surveys that focus narrowly on specific aspects, this work connects 50+ benchmarks to their corresponding solution strategies, enabling researchers to identify optimal approaches for diverse evaluation criteria. We also identify critical research gaps and propose future directions, including multi-agent collaboration, self-evolving systems, and formal verification integration. This survey serves as a foundational guide for advancing LLM-driven software engineering. We maintain a GitHub repository that continuously updates the reviewed and related papers at https://github.com/lisaGuojl/LLM-Agent-SE-Survey.

Paper Structure

This paper contains 40 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustrations of process of LLM-empowered software engineering
  • Figure 3: Taxonomy of existing studies for Software Engineering
  • Figure 4: Overview of reviewed studies.
  • Figure 5: The components of prompt-based solutions.
  • Figure 6: The components of fine-tune-based solutions.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Definition 1: Code Generation
  • Definition 2: Code Translation
  • Definition 3: Program Repair