Table of Contents
Fetching ...

Bayesian Data Analysis in Empirical Software Engineering Research

Carlo A. Furia, Robert Feldt, Richard Torkar

TL;DR

This paper addresses the prevalence and limitations of frequentist statistics in empirical software engineering and demonstrates how Bayesian data analysis can provide clearer, more robust and nuanced conclusions. By reanalyzing two empirical studies—the effectiveness of automatically generated tests and the Rosetta Code language benchmarks—the authors show that Bayesian methods yield full posterior distributions, allow meaningful prior incorporation, and support predictive simulations for practical decision-making. They present a high-level Bayesian framework, illustrate both linear and generalized models (including Poisson GLMs), and perform prior-sensitivity analyses to reveal how conclusions depend on prior assumptions. The work argues that Bayesian statistics improve interpretability, reduce misinterpretation of significance, and enhance generalizability across studies, ultimately offering a principled path toward more actionable software engineering insights. The practical contribution includes guidelines and tool recommendations that facilitate adopting Bayesian analyses in future empirical research, with an emphasis on incorporating domain knowledge through priors, visualizing posteriors, and focusing on measures of practical significance rather than binary hypotheses.

Abstract

Statistics comes in two main flavors: frequentist and Bayesian. For historical and technical reasons, frequentist statistics have traditionally dominated empirical data analysis, and certainly remain prevalent in empirical software engineering. This situation is unfortunate because frequentist statistics suffer from a number of shortcomings---such as lack of flexibility and results that are unintuitive and hard to interpret---that curtail their effectiveness when dealing with the heterogeneous data that is increasingly available for empirical analysis of software engineering practice. In this paper, we pinpoint these shortcomings, and present Bayesian data analysis techniques that provide tangible benefits---as they can provide clearer results that are simultaneously robust and nuanced. After a short, high-level introduction to the basic tools of Bayesian statistics, we present the reanalysis of two empirical studies on the effectiveness of automatically generated tests and the performance of programming languages. By contrasting the original frequentist analyses with our new Bayesian analyses, we demonstrate the concrete advantages of the latter. To conclude we advocate a more prominent role for Bayesian statistical techniques in empirical software engineering research and practice.

Bayesian Data Analysis in Empirical Software Engineering Research

TL;DR

This paper addresses the prevalence and limitations of frequentist statistics in empirical software engineering and demonstrates how Bayesian data analysis can provide clearer, more robust and nuanced conclusions. By reanalyzing two empirical studies—the effectiveness of automatically generated tests and the Rosetta Code language benchmarks—the authors show that Bayesian methods yield full posterior distributions, allow meaningful prior incorporation, and support predictive simulations for practical decision-making. They present a high-level Bayesian framework, illustrate both linear and generalized models (including Poisson GLMs), and perform prior-sensitivity analyses to reveal how conclusions depend on prior assumptions. The work argues that Bayesian statistics improve interpretability, reduce misinterpretation of significance, and enhance generalizability across studies, ultimately offering a principled path toward more actionable software engineering insights. The practical contribution includes guidelines and tool recommendations that facilitate adopting Bayesian analyses in future empirical research, with an emphasis on incorporating domain knowledge through priors, visualizing posteriors, and focusing on measures of practical significance rather than binary hypotheses.

Abstract

Statistics comes in two main flavors: frequentist and Bayesian. For historical and technical reasons, frequentist statistics have traditionally dominated empirical data analysis, and certainly remain prevalent in empirical software engineering. This situation is unfortunate because frequentist statistics suffer from a number of shortcomings---such as lack of flexibility and results that are unintuitive and hard to interpret---that curtail their effectiveness when dealing with the heterogeneous data that is increasingly available for empirical analysis of software engineering practice. In this paper, we pinpoint these shortcomings, and present Bayesian data analysis techniques that provide tangible benefits---as they can provide clearer results that are simultaneously robust and nuanced. After a short, high-level introduction to the basic tools of Bayesian statistics, we present the reanalysis of two empirical studies on the effectiveness of automatically generated tests and the performance of programming languages. By contrasting the original frequentist analyses with our new Bayesian analyses, we demonstrate the concrete advantages of the latter. To conclude we advocate a more prominent role for Bayesian statistical techniques in empirical software engineering research and practice.

Paper Structure

This paper contains 38 sections, 16 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The $p$-value is the probability, under the null hypothesis, of drawing data that is at least as extreme as the observed data $D^*$. Graphically, the $p$-value is the shaded area under the curve, which models the probability distribution under the null hypothesis.
  • Figure 2: Plots of the posterior probability distributions of each regression coefficient in the linear model with Gaussian error model \ref{['eq:basic-lr']}, computed with Bayesian analysis and weak unbiased priors. Each plot is a density plot that covers an area corresponding to 99% probability; the inner shaded area corresponds to 95% probability; and the thick vertical line marks the distribution's median.
  • Figure 3: Plots of the posterior probability distributions of each regression coefficient in the generalized linear model with Poisson error model \ref{['eq:poisson-glm']}, computed with Bayesian analysis. Each plot is a density plot that covers an area corresponding to 99% probability; the inner shaded area corresponds to 95% probability; and the thick vertical line marks the distribution's median.
  • Figure 4: Language relationship graphs summarizing the frequentist reanalysis of Nanz and Furia rosetta, with different $p$-value correction methods. In every graph, an arrow from node $\ell_1$ to node $\ell_2$ indicates that the $p$-value comparing performance data between the two languages is $p < 0.01$ (solid line) or $0.01 \leq p < 0.05$ (dotted line), and Cliff's $\delta$ effect size indicates that $\ell_2$ tends to be faster; the thickness of the arrows is proportional to the absolute value of Cliff's $\delta$.
  • Figure 5: Bayesian reanalysis of Nanz and Furia rosetta's Rosetta Code data about the performance of 8 programming languages. For each comparison of language $\ell_1$ ( column header) with language $\ell_2$ (row header), for each choice of prior distribution among uniform $\mathcal{U}$, centered normal $\mathcal{N}$, and shifted normal $\mathcal{S}$, the table reports the endpoint of the 95% and 99% uncertainty intervals of the posterior inverse speedup of $\ell_1$ vs. $\ell_2$ that is closest to the origin, if such interval does not include the origin; in this case, the endpoint's absolute value is a lower bound on the inverse speedup of $\ell_1$ vs. $\ell_2$, and indicates that $\ell_1$ tends to be faster if it is negative and that $\ell_2$ tends to be faster if it is positive. If the uncertainty interval includes the origin, the table reports a value of 0.0, which indicates the performance comparison is inconclusive. The table also reports median $m$ and mean $\mu$ of the posterior inverse speedup of $\ell_1$ vs. $\ell_2$ (again, negative values indicate that $\ell_1$ is faster on average, and positive values that $\ell_2$ is).
  • ...and 2 more figures

Theorems & Definitions (1)

  • Example 2.1