Teaching Mining Software Repositories
Zadia Codabux, Fatemeh Fard, Roberto Verdecchia, Fabio Palomba, Dario Di Nucci, Gilberto Recupito
TL;DR
This chapter provides a comprehensive, educator-focused guide to Mining Software Repositories (MSR), detailing a structured study design using the Goal-Question-Metric (GQM) framework, data extraction and cleaning, static and dynamic code analysis, and diverse data analysis techniques for unstructured and structured data. It emphasizes data quality, replicability, and clear communication through descriptive metrics and visuals, while addressing threats to validity and ethical considerations in MSR research. The work also discusses complementary mixed-method approaches, ethical guidelines, and current trends such as AI-enabled MSR and software ecosystem analysis, offering practical teaching strategies, exercises, and a starter list of repositories to facilitate hands-on MSR education. Collectively, it equips MSc and PhD students and educators with methodological guidance, ethical awareness, and tools to conduct reproducible MSR studies that yield actionable insights for software engineering practice. The chapter closes by illustrating how MSR can be integrated with qualitative and quantitative methods to produce richer, human-centered understanding of software development processes.
Abstract
Mining Software Repositories (MSR) has become a popular research area recently. MSR analyzes different sources of data, such as version control systems, code repositories, defect tracking systems, archived communication, deployment logs, and so on, to uncover interesting and actionable insights from the data for improved software development, maintenance, and evolution. This chapter provides an overview of MSR and how to conduct an MSR study, including setting up a study, formulating research goals and questions, identifying repositories, extracting and cleaning the data, performing data analysis and synthesis, and discussing MSR study limitations. Furthermore, the chapter discusses MSR as part of a mixed method study, how to mine data ethically, and gives an overview of recent trends in MSR as well as reflects on the future. As a teaching aid, the chapter provides tips for educators, exercises for students at all levels, and a list of repositories that can be used as a starting point for an MSR study.
