On the Creation of Representative Samples of Software Repositories
June Gorostidi, Adem Ait, Jordi Cabot, Javier Luis Cánovas Izquierdo
TL;DR
The paper addresses the challenge of constructing representative samples of large software repository populations, where conventional sampling often relies on random or popularity-based selections that threaten study validity. It introduces a four‑phase methodology (variable selection, variable analysis with preprocessing and stratification, composition, and sampling) implemented as a Python tool to build stratified random samples aligned with variables of interest. The approach supports both numerical and categorical variables and includes a replicability package, demonstrated through Hugging Face HFCommunity use cases. The work provides a practical, reproducible framework for improving representativeness in MSR studies, enabling better generalization and reducing selection biases in repository data analyses.
Abstract
Software repositories is one of the sources of data in Empirical Software Engineering, primarily in the Mining Software Repositories field, aimed at extracting knowledge from the dynamics and practice of software projects. With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies. With this massive amount of data, sampling techniques are needed to create more manageable datasets. The creation of these datasets is a crucial step, and researchers have to carefully select the repositories to create representative samples according to a set of variables of interest. However, current sampling methods are often based on random selection or rely on variables which may not be related to the research study (e.g., popularity or activity). In this paper, we present a methodology for creating representative samples of software repositories, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study. We illustrate our approach with use cases based on Hugging Face repositories.
