LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

Junda He; Jieke Shi; Terry Yue Zhuo; Christoph Treude; Jiamou Sun; Zhenchang Xing; Xiaoning Du; David Lo

LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, David Lo

TL;DR

The paper addresses the challenge of scalable evaluation for LLM-generated software artifacts in SE. It defines LLM-as-a-Judge formally and surveys 42 primary studies, categorizing applications across requirements engineering, coding, QA, and maintenance, and identifies gaps. It provides a roadmap toward 2030 focused on benchmarks, internal and external intelligence, multi-modal evaluation, robustness, and human-in-the-loop collaboration. The work aims to foster adoption of LLM-as-a-Judge frameworks to improve the scalability and reliability of software artifact evaluation.

Abstract

The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks like code generation, producing a massive volume of software artifacts. This surge has exposed a critical bottleneck: the lack of scalable, reliable methods to evaluate these outputs. Human evaluation is costly and time-consuming, while traditional automated metrics like BLEU fail to capture nuanced quality aspects. In response, the LLM-as-a-Judge paradigm - using LLMs for automated evaluation - has emerged. This approach leverages the advanced reasoning of LLMs, offering a path toward human-like nuance at automated scale. However, LLM-as-a-Judge research in SE is still in its early stages. This forward-looking SE 2030 paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts. We provide a literature review of existing SE studies, analyze their limitations, identify key research gaps, and outline a detailed roadmap. We envision these frameworks as reliable, robust, and scalable human surrogates capable of consistent, multi-faceted artifact evaluation by 2030. Our work aims to foster research and adoption of LLM-as-a-Judge frameworks, ultimately improving the scalability of software artifact evaluation.

LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

TL;DR

Abstract

LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)