A Framework for Using LLMs for Repository Mining Studies in Empirical Software Engineering

Vincenzo de Martino; Joel Castaño; Fabio Palomba; Xavier Franch; Silverio Martínez-Fernández

A Framework for Using LLMs for Repository Mining Studies in Empirical Software Engineering

Vincenzo de Martino, Joel Castaño, Fabio Palomba, Xavier Franch, Silverio Martínez-Fernández

TL;DR

The paper addresses the challenge of reliably applying LLMs to software repository mining within empirical software engineering. It introduces PRIMES, a four-stage framework guiding prompt creation, pilot testing, cross-model evaluation, and output validation to improve data collection and reproducibility. Through reflections on two prior studies—one on GitHub ML projects and another on Hugging Face models—it demonstrates concrete prompt-design, validation, and tracking practices that enhance data quality. The authors also discuss limitations such as hallucinations, biases, and costs, and call for standardization and extension of the framework to make LLM-enabled repository mining more robust and scalable.

Abstract

Context: The emergence of Large Language Models (LLMs) has significantly transformed Software Engineering (SE) by providing innovative methods for analyzing software repositories. Objectives: Our objective is to establish a practical framework for future SE researchers needing to enhance the data collection and dataset while conducting software repository mining studies using LLMs. Method: This experience report shares insights from two previous repository mining studies, focusing on the methodologies used for creating, refining, and validating prompts that enhance the output of LLMs, particularly in the context of data collection in empirical studies. Results: Our research packages a framework, coined Prompt Refinement and Insights for Mining Empirical Software repositories (PRIMES), consisting of a checklist that can improve LLM usage performance, enhance output quality, and minimize errors through iterative processes and comparisons among different LLMs. We also emphasize the significance of reproducibility by implementing mechanisms for tracking model results. Conclusion: Our findings indicate that standardizing prompt engineering and using PRIMES can enhance the reliability and reproducibility of studies utilizing LLMs. Ultimately, this work calls for further research to address challenges like hallucinations, model biases, and cost-effectiveness in integrating LLMs into workflows.

A Framework for Using LLMs for Repository Mining Studies in Empirical Software Engineering

TL;DR

Abstract

A Framework for Using LLMs for Repository Mining Studies in Empirical Software Engineering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)