Table of Contents
Fetching ...

Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality

Carlo A. Furia, Richard Torkar, Robert Feldt

TL;DR

The high-level conclusions of this exercise will be that Bayesian statistical techniques can be applied to analyze software engineering data in a way that is principled, flexible, and leads to convincing results that inform the state-of-the-art while highlighting the boundaries of its validity.

Abstract

Statistical analysis is the tool of choice to turn data into information, and then information into empirical knowledge. To be valid, the process that goes from data to knowledge should be supported by detailed, rigorous guidelines, which help ferret out issues with the data or model, and lead to qualified results that strike a reasonable balance between generality and practical relevance. Such guidelines are being developed by statisticians to support the latest techniques for Bayesian data analysis. In this article, we frame these guidelines in a way that is apt to empirical research in software engineering. To demonstrate the guidelines in practice, we apply them to reanalyze a GitHub dataset about code quality in different programming languages. The dataset's original analysis (Ray et al., 2014) and a critical reanalysis (Berger at al., 2019) have attracted considerable attention -- in no small part because they target a topic (the impact of different programming languages) on which strong opinions abound. The goals of our reanalysis are largely orthogonal to this previous work, as we are concerned with demonstrating, on data in an interesting domain, how to build a principled Bayesian data analysis and to showcase some of its benefits. In the process, we will also shed light on some critical aspects of the analyzed data and of the relationship between programming languages and code quality. The high-level conclusions of our exercise will be that Bayesian statistical techniques can be applied to analyze software engineering data in a way that is principled, flexible, and leads to convincing results that inform the state of the art while highlighting the boundaries of its validity. The guidelines can support building solid statistical analyses and connecting their results, and hence help buttress continued progress in empirical software engineering research.

Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality

TL;DR

The high-level conclusions of this exercise will be that Bayesian statistical techniques can be applied to analyze software engineering data in a way that is principled, flexible, and leads to convincing results that inform the state-of-the-art while highlighting the boundaries of its validity.

Abstract

Statistical analysis is the tool of choice to turn data into information, and then information into empirical knowledge. To be valid, the process that goes from data to knowledge should be supported by detailed, rigorous guidelines, which help ferret out issues with the data or model, and lead to qualified results that strike a reasonable balance between generality and practical relevance. Such guidelines are being developed by statisticians to support the latest techniques for Bayesian data analysis. In this article, we frame these guidelines in a way that is apt to empirical research in software engineering. To demonstrate the guidelines in practice, we apply them to reanalyze a GitHub dataset about code quality in different programming languages. The dataset's original analysis (Ray et al., 2014) and a critical reanalysis (Berger at al., 2019) have attracted considerable attention -- in no small part because they target a topic (the impact of different programming languages) on which strong opinions abound. The goals of our reanalysis are largely orthogonal to this previous work, as we are concerned with demonstrating, on data in an interesting domain, how to build a principled Bayesian data analysis and to showcase some of its benefits. In the process, we will also shed light on some critical aspects of the analyzed data and of the relationship between programming languages and code quality. The high-level conclusions of our exercise will be that Bayesian statistical techniques can be applied to analyze software engineering data in a way that is principled, flexible, and leads to convincing results that inform the state of the art while highlighting the boundaries of its validity. The guidelines can support building solid statistical analyses and connecting their results, and hence help buttress continued progress in empirical software engineering research.

Paper Structure

This paper contains 59 sections, 2 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Process for Bayesian data analysis: starting from an initial model, assess whether it is plausible, workable, and adequate. If it lacks any of these characteristics, refine the model by adding detail and features. Models that pass all checks can be fitted and used to answer the analysis's specific questions. Different models that pass all checks can be rigorously compared to select those that perform "best" according to suitable criteria. The outer loop (dashed arrows) indicates that an analysis's results may also suggest to extend an adequate model so that it can answer more precise, or just different, questions; this outer loop is another source of multiple models that can be compared.
  • Figure 2: Violin plots of the distributions of number of bugs per project for each programming language in the original FSE dataset. Languages are sorted, left-to-right, by decreasing values of the distributions' medians. The vertical axis's scale is logarithmic in base $10$. An horizontal line marks the median number of bugs per project across all languages.
  • Figure 3: The likelihoods of statistical models $\mathcal{M}_1$, $\mathcal{M}_2$, and $\mathcal{M}_3$. Colors highlight the terms that are added to each model compared to the previous ones.
  • Figure 4: The priors of statistical models $\mathcal{M}_1$, $\mathcal{M}_2$, and $\mathcal{M}_3$. Colors highlight the terms that are added to each model compared to the previous ones.
  • Figure 5: Prior predictive simulation plots for models $\mathcal{M}_2$ and $\mathcal{M}_3$: each thin light blue line pictures one simulated distribution of the number of bugs in a project drawn from the priors. For comparison, the thick dark blue line pictures the distribution of the number of bugs in the measured data. The horizontal scale is logarithmic in base $10$.
  • ...and 7 more figures