Table of Contents
Fetching ...

Empirical and Sustainability Aspects of Software Engineering Research in the Era of Large Language Models: A Reflection

David Williams, Max Hort, Maria Kechagia, Aldeida Aleti, Justyna Petke, Federica Sarro

TL;DR

This paper addresses the rigour challenges of integrating LLMs into Software Engineering research by performing a systematic review of ICSE's technical track publications from 2023–2025 and a supporting author survey. It uses a mixed-methods approach to extract a taxonomy of practices around models, benchmarks, contamination, replicability, and sustainability. Key results show a shift toward closed GPT-family models, uneven benchmarking against non-LLM baselines, rising attention to data contamination but limited mitigation, and substantial but under-quantified sustainability costs with poor artefact availability. The authors propose concrete recommendations and aim to contribute to unified guidelines for responsible and sustainable LLM-based SE research.

Abstract

Software Engineering (SE) research involving the use of Large Language Models (LLMs) has introduced several new challenges related to rigour in benchmarking, contamination, replicability, and sustainability. In this paper, we invite the research community to reflect on how these challenges are addressed in SE. Our results provide a structured overview of current LLM-based SE research at ICSE, highlighting both encouraging practices and persistent shortcomings. We conclude with recommendations to strengthen benchmarking rigour, improve replicability, and address the financial and environmental costs of LLM-based SE.

Empirical and Sustainability Aspects of Software Engineering Research in the Era of Large Language Models: A Reflection

TL;DR

This paper addresses the rigour challenges of integrating LLMs into Software Engineering research by performing a systematic review of ICSE's technical track publications from 2023–2025 and a supporting author survey. It uses a mixed-methods approach to extract a taxonomy of practices around models, benchmarks, contamination, replicability, and sustainability. Key results show a shift toward closed GPT-family models, uneven benchmarking against non-LLM baselines, rising attention to data contamination but limited mitigation, and substantial but under-quantified sustainability costs with poor artefact availability. The authors propose concrete recommendations and aim to contribute to unified guidelines for responsible and sustainable LLM-based SE research.

Abstract

Software Engineering (SE) research involving the use of Large Language Models (LLMs) has introduced several new challenges related to rigour in benchmarking, contamination, replicability, and sustainability. In this paper, we invite the research community to reflect on how these challenges are addressed in SE. Our results provide a structured overview of current LLM-based SE research at ICSE, highlighting both encouraging practices and persistent shortcomings. We conclude with recommendations to strengthen benchmarking rigour, improve replicability, and address the financial and environmental costs of LLM-based SE.

Paper Structure

This paper contains 4 sections, 2 tables.