Table of Contents
Fetching ...

Towards Evaluation Guidelines for Empirical Studies involving LLMs

Stefan Wagner, Marvin Muñoz Barón, Davide Falessi, Sebastian Baltes

TL;DR

The paper addresses the challenge of reproducibly evaluating empirical studies involving large language models in software engineering by classifying study types and proposing preliminary guidelines. It argues that LLM-specific variability—such as version drift, prompts, and non-determinism—necessitates tailored, transparent reporting to ensure validity and reproducibility. The main contributions are a taxonomy of study types and a set of actionable guidelines (declaration of usage, versioning, configuration, prompts, open baselines, and human validation) to improve rigor. This work is significant for researchers, reviewers, and practitioners seeking trustworthy, comparable evaluations of LLM-enabled SE tools and methods, and it invites ongoing community discussion to refine and extend the guidelines.

Abstract

In the short period since the release of ChatGPT, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of holistic guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of our standards for high-quality empirical studies involving LLMs.

Towards Evaluation Guidelines for Empirical Studies involving LLMs

TL;DR

The paper addresses the challenge of reproducibly evaluating empirical studies involving large language models in software engineering by classifying study types and proposing preliminary guidelines. It argues that LLM-specific variability—such as version drift, prompts, and non-determinism—necessitates tailored, transparent reporting to ensure validity and reproducibility. The main contributions are a taxonomy of study types and a set of actionable guidelines (declaration of usage, versioning, configuration, prompts, open baselines, and human validation) to improve rigor. This work is significant for researchers, reviewers, and practitioners seeking trustworthy, comparable evaluations of LLM-enabled SE tools and methods, and it invites ongoing community discussion to refine and extend the guidelines.

Abstract

In the short period since the release of ChatGPT, large language models (LLMs) have changed the software engineering research landscape. While there are numerous opportunities to use LLMs for supporting research or software engineering tasks, solid science needs rigorous empirical evaluations. However, so far, there are no specific guidelines for conducting and assessing studies involving LLMs in software engineering research. Our focus is on empirical studies that either use LLMs as part of the research process or studies that evaluate existing or new tools that are based on LLMs. This paper contributes the first set of holistic guidelines for such studies. Our goal is to start a discussion in the software engineering research community to reach a common understanding of our standards for high-quality empirical studies involving LLMs.

Paper Structure

This paper contains 18 sections.