Bayesian Data Analysis in Empirical Software Engineering Research
Carlo A. Furia, Robert Feldt, Richard Torkar
TL;DR
This paper addresses the prevalence and limitations of frequentist statistics in empirical software engineering and demonstrates how Bayesian data analysis can provide clearer, more robust and nuanced conclusions. By reanalyzing two empirical studies—the effectiveness of automatically generated tests and the Rosetta Code language benchmarks—the authors show that Bayesian methods yield full posterior distributions, allow meaningful prior incorporation, and support predictive simulations for practical decision-making. They present a high-level Bayesian framework, illustrate both linear and generalized models (including Poisson GLMs), and perform prior-sensitivity analyses to reveal how conclusions depend on prior assumptions. The work argues that Bayesian statistics improve interpretability, reduce misinterpretation of significance, and enhance generalizability across studies, ultimately offering a principled path toward more actionable software engineering insights. The practical contribution includes guidelines and tool recommendations that facilitate adopting Bayesian analyses in future empirical research, with an emphasis on incorporating domain knowledge through priors, visualizing posteriors, and focusing on measures of practical significance rather than binary hypotheses.
Abstract
Statistics comes in two main flavors: frequentist and Bayesian. For historical and technical reasons, frequentist statistics have traditionally dominated empirical data analysis, and certainly remain prevalent in empirical software engineering. This situation is unfortunate because frequentist statistics suffer from a number of shortcomings---such as lack of flexibility and results that are unintuitive and hard to interpret---that curtail their effectiveness when dealing with the heterogeneous data that is increasingly available for empirical analysis of software engineering practice. In this paper, we pinpoint these shortcomings, and present Bayesian data analysis techniques that provide tangible benefits---as they can provide clearer results that are simultaneously robust and nuanced. After a short, high-level introduction to the basic tools of Bayesian statistics, we present the reanalysis of two empirical studies on the effectiveness of automatically generated tests and the performance of programming languages. By contrasting the original frequentist analyses with our new Bayesian analyses, we demonstrate the concrete advantages of the latter. To conclude we advocate a more prominent role for Bayesian statistical techniques in empirical software engineering research and practice.
