Table of Contents
Fetching ...

A Method to Assess and Argue for Practical Significance in Software Engineering

Richard Torkar, Carlo A. Furia, Robert Feldt, Francisco Gomes de Oliveira Neto, Lucas Gren, Per Lenberg, Neil A. Ernst

TL;DR

This paper addresses the gap between statistical significance and practical impact in empirical software engineering by proposing a framework that combines Bayesian data analysis with Cumulative Prospect Theory (CPT). It applies this approach to reanalyze Afzal et al.'s case study on testing practices, deriving posterior distributions and CPT-based utilities that translate statistical findings into decision-oriented measures under realistic costs. The key contributions include a Bayesian multilevel analysis with zero-inflation to reflect data characteristics, integration with CPT to yield subjective utilities, and empirical validation with practitioners showing improved decision confidence. The method enables uncertainty propagation from data to practical recommendations, potentially enhancing the relevance and uptake of SE research in industry.

Abstract

A key goal of empirical research in software engineering is to assess practical significance, which answers whether the observed effects of some compared treatments show a relevant difference in practice in realistic scenarios. Even though plenty of standard techniques exist to assess statistical significance, connecting it to practical significance is not straightforward or routinely done; indeed, only a few empirical studies in software engineering assess practical significance in a principled and systematic way. In this paper, we argue that Bayesian data analysis provides suitable tools to assess practical significance rigorously. We demonstrate our claims in a case study comparing different test techniques. The case study's data was previously analyzed (Afzal et al., 2015) using standard techniques focusing on statistical significance. Here, we build a multilevel model of the same data, which we fit and validate using Bayesian techniques. Our method is to apply cumulative prospect theory on top of the statistical model to quantitatively connect our statistical analysis output to a practically meaningful context. This is then the basis both for assessing and arguing for practical significance. Our study demonstrates that Bayesian analysis provides a technically rigorous yet practical framework for empirical software engineering. A substantial side effect is that any uncertainty in the underlying data will be propagated through the statistical model, and its effects on practical significance are made clear. Thus, in combination with cumulative prospect theory, Bayesian analysis supports seamlessly assessing practical significance in an empirical software engineering context, thus potentially clarifying and extending the relevance of research for practitioners.

A Method to Assess and Argue for Practical Significance in Software Engineering

TL;DR

This paper addresses the gap between statistical significance and practical impact in empirical software engineering by proposing a framework that combines Bayesian data analysis with Cumulative Prospect Theory (CPT). It applies this approach to reanalyze Afzal et al.'s case study on testing practices, deriving posterior distributions and CPT-based utilities that translate statistical findings into decision-oriented measures under realistic costs. The key contributions include a Bayesian multilevel analysis with zero-inflation to reflect data characteristics, integration with CPT to yield subjective utilities, and empirical validation with practitioners showing improved decision confidence. The method enables uncertainty propagation from data to practical recommendations, potentially enhancing the relevance and uptake of SE research in industry.

Abstract

A key goal of empirical research in software engineering is to assess practical significance, which answers whether the observed effects of some compared treatments show a relevant difference in practice in realistic scenarios. Even though plenty of standard techniques exist to assess statistical significance, connecting it to practical significance is not straightforward or routinely done; indeed, only a few empirical studies in software engineering assess practical significance in a principled and systematic way. In this paper, we argue that Bayesian data analysis provides suitable tools to assess practical significance rigorously. We demonstrate our claims in a case study comparing different test techniques. The case study's data was previously analyzed (Afzal et al., 2015) using standard techniques focusing on statistical significance. Here, we build a multilevel model of the same data, which we fit and validate using Bayesian techniques. Our method is to apply cumulative prospect theory on top of the statistical model to quantitatively connect our statistical analysis output to a practically meaningful context. This is then the basis both for assessing and arguing for practical significance. Our study demonstrates that Bayesian analysis provides a technically rigorous yet practical framework for empirical software engineering. A substantial side effect is that any uncertainty in the underlying data will be propagated through the statistical model, and its effects on practical significance are made clear. Thus, in combination with cumulative prospect theory, Bayesian analysis supports seamlessly assessing practical significance in an empirical software engineering context, thus potentially clarifying and extending the relevance of research for practitioners.

Paper Structure

This paper contains 25 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Assessing practical significance using a combination of Bayesian analysis and cumulative prospect theory.
  • Figure 2: An illustration of our approach using BDA and CPT applied to the decision between bigger or smaller test suites for cost-effective testing. The GLM enables the prediction of failures based on the test suite sizes via a Binomial likelihood. The expert (senior test manager) provides the cost model of running a test suite (e.g., including costs for both size and fault fixing) that is then framed as choices between test suite sizes with corresponding utility values. Practitioners then choose the outcome with better prospect utility.
  • Figure 3: Posterior marginal probability distributions of $\beta_e$ (experience, top) and $\beta_a$ (approach, bottom named 'Technique'). The thick lines mark the medians, and the yellow areas cover 94% of probability.
  • Figure 4: Left: expected number of detected faults with 94% probability for developers using exploratory testing and test-case based testing. Right: expected number of detected faults with 94% probability for developers with low experience and high experience.
  • Figure 5: Utility for different choices in scenario approach: the manager chooses whether developers use exploratory testing (ET) or test-case based testing (TCT).
  • ...and 4 more figures