Table of Contents
Fetching ...

Crossover Designs in Software Engineering Experiments: Review of the State of Analysis

Julian Frattini, Davide Fucci, Sira Vegas

TL;DR

The paper investigates how the crossover-design guidelines introduced by Vegas et al. influence data analyses in SE experiments. Using forward snowballing of guideline-citing studies, it analyzes 136 publications describing 67 crossover experiments across 48 primary studies, extracting artifacts and assessing threat-addressal. It finds that only about 29.5% of validity threats are properly addressed, with carryover effects rarely modeled (~3%), though there are partial improvements such as period/sequence reporting; data and script availability remain limited. The work highlights the need for stronger guideline adherence, broader methodological adoption (e.g., GLMMs), and better replication artifacts to improve the reliability and interpretability of SE crossover-design research.

Abstract

Experimentation is an essential method for causal inference in any empirical discipline. Crossover-design experiments are common in Software Engineering (SE) research. In these, subjects apply more than one treatment in different orders. This design increases the amount of obtained data and deals with subject variability but introduces threats to internal validity like the learning and carryover effect. Vegas et al. reviewed the state of practice for crossover designs in SE research and provided guidelines on how to address its threats during data analysis while still harnessing its benefits. In this paper, we reflect on the impact of these guidelines and review the state of analysis of crossover design experiments in SE publications between 2015 and March 2024. To this end, by conducting a forward snowballing of the guidelines, we survey 136 publications reporting 67 crossover-design experiments and evaluate their data analysis against the provided guidelines. The results show that the validity of data analyses has improved compared to the original state of analysis. Still, despite the explicit guidelines, only 29.5% of all threats to validity were addressed properly. While the maturation and the optimal sequence threats are properly addressed in 35.8% and 38.8% of all studies in our sample respectively, the carryover threat is only modeled in about 3% of the observed cases. The lack of adherence to the analysis guidelines threatens the validity of the conclusions drawn from crossover design experiments

Crossover Designs in Software Engineering Experiments: Review of the State of Analysis

TL;DR

The paper investigates how the crossover-design guidelines introduced by Vegas et al. influence data analyses in SE experiments. Using forward snowballing of guideline-citing studies, it analyzes 136 publications describing 67 crossover experiments across 48 primary studies, extracting artifacts and assessing threat-addressal. It finds that only about 29.5% of validity threats are properly addressed, with carryover effects rarely modeled (~3%), though there are partial improvements such as period/sequence reporting; data and script availability remain limited. The work highlights the need for stronger guideline adherence, broader methodological adoption (e.g., GLMMs), and better replication artifacts to improve the reliability and interpretability of SE crossover-design research.

Abstract

Experimentation is an essential method for causal inference in any empirical discipline. Crossover-design experiments are common in Software Engineering (SE) research. In these, subjects apply more than one treatment in different orders. This design increases the amount of obtained data and deals with subject variability but introduces threats to internal validity like the learning and carryover effect. Vegas et al. reviewed the state of practice for crossover designs in SE research and provided guidelines on how to address its threats during data analysis while still harnessing its benefits. In this paper, we reflect on the impact of these guidelines and review the state of analysis of crossover design experiments in SE publications between 2015 and March 2024. To this end, by conducting a forward snowballing of the guidelines, we survey 136 publications reporting 67 crossover-design experiments and evaluate their data analysis against the provided guidelines. The results show that the validity of data analyses has improved compared to the original state of analysis. Still, despite the explicit guidelines, only 29.5% of all threats to validity were addressed properly. While the maturation and the optimal sequence threats are properly addressed in 35.8% and 38.8% of all studies in our sample respectively, the carryover threat is only modeled in about 3% of the observed cases. The lack of adherence to the analysis guidelines threatens the validity of the conclusions drawn from crossover design experiments
Paper Structure (17 sections, 8 figures, 3 tables)

This paper contains 17 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Relevant Factors Influencing the Response Variable in a Crossover-Design Experiment
  • Figure 2: Types of subjects in the experiments
  • Figure 3: Number of subjects in the experiments
  • Figure 4: Applied statistical methods
  • Figure 5: Applied NHSTs
  • ...and 3 more figures