Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Sebastian Baltes; Florian Angermeir; Chetan Arora; Marvin Muñoz Barón; Chunyang Chen; Lukas Böhme; Fabio Calefato; Neil Ernst; Davide Falessi; Brian Fitzgerald; Davide Fucci; Marcos Kalinowski; Stefano Lambiase; Daniel Russo; Mircea Lungu; Lutz Prechelt; Paul Ralph; Rijnard van Tonder; Christoph Treude; Stefan Wagner

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

Sebastian Baltes, Florian Angermeir, Chetan Arora, Marvin Muñoz Barón, Chunyang Chen, Lukas Böhme, Fabio Calefato, Neil Ernst, Davide Falessi, Brian Fitzgerald, Davide Fucci, Marcos Kalinowski, Stefano Lambiase, Daniel Russo, Mircea Lungu, Lutz Prechelt, Paul Ralph, Rijnard van Tonder, Christoph Treude, Stefan Wagner

TL;DR

The paper addresses the reproducibility challenges posed by large language models in software engineering research. It introduces a taxonomy of eight LLM-enabled study types and eight must/should guidelines to improve design, reporting, and transparency. By detailing, with examples, the roles of LLMs as annotators, judges, synthesizers, and subjects—and as tools for researchers and engineers—the work provides a comprehensive framework for robust, auditable empirical work. As a living resource, it aims to unite the SE community in adopting open baselines, detailed prompts, interaction logs, and replication packages to advance credible, transferable insights in AI-assisted SE practice.

Abstract

Large language models (LLMs) are increasingly being integrated into software engineering (SE) research and practice, yet their non-determinism, opaque training data, and evolving architectures complicate the reproduction and replication of empirical studies. We present a community effort to scope this space, introducing a taxonomy of LLM-based study types together with eight guidelines for designing and reporting empirical studies involving LLMs. The guidelines present essential (must) criteria as well as desired (should) criteria and target transparency throughout the research process. Our recommendations, contextualized by our study types, are: (1) to declare LLM usage and role; (2) to report model versions, configurations, and fine-tuning; (3) to document tool architectures; (4) to disclose prompts and interaction logs; (5) to use human validation; (6) to employ an open LLM as a baseline; (7) to use suitable baselines, benchmarks, and metrics; and (8) to openly articulate limitations and mitigations. Our goal is to enable reproducibility and replicability despite LLM-specific barriers to open science. We maintain the study types and guidelines online as a living resource for the community to use and shape (llm-guidelines.org).

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

TL;DR

Abstract

Guidelines for Empirical Studies in Software Engineering involving Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents