Table of Contents
Fetching ...

Empirical Derivations from an Evolving Test Suite

Jukka Ruohonen, Abhishek Tiwari

TL;DR

This longitudinal study analyzes NetBSD's automated, virtualization-based test suite from 2011 to 2025 to understand growth, failure dynamics, and potential drivers. Using monthly and architecture-specific data, the authors apply Kruskal-Wallis tests and regression analyses (OLS, LR, NB checks) to relate test outcomes to code churn (commit counts) and kernel modifications (sys-root commits). They find continuous growth of the test suite to over $11k$ tests, generally stable failure levels with periodic bursts, and only small, inconsistent effects from churn and kernel changes—though kernel changes can modestly raise test failures in some periods. The work highlights the value and limitations of extracting empirical insights from large, evolving test suites in OS software and suggests directions for deeper, period-specific investigations.

Abstract

The paper presents a longitudinal empirical analysis of the automated, continuous, and virtualization-based software test suite of the NetBSD operating system. The longitudinal period observed spans from the initial roll out of the test suite in the early 2010s to late 2025. According to the results, the test suite has grown continuously, currently covering over ten thousand individual test cases. Failed test cases exhibit overall stability, although there have been shorter periods marked with more frequent failures. A similar observation applies to build failures, failures of the test suite to complete, and installation failures, all of which are also captured by the NetBSD's testing framework. Finally, code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures. Although some periods exhibit larger effects, including particularly with respect to the kernel modifications, the effects are small on average. Even though only in an exploratory manner, these empirical observations contribute to efforts to draw conclusions from large-scale and evolving software test suites.

Empirical Derivations from an Evolving Test Suite

TL;DR

This longitudinal study analyzes NetBSD's automated, virtualization-based test suite from 2011 to 2025 to understand growth, failure dynamics, and potential drivers. Using monthly and architecture-specific data, the authors apply Kruskal-Wallis tests and regression analyses (OLS, LR, NB checks) to relate test outcomes to code churn (commit counts) and kernel modifications (sys-root commits). They find continuous growth of the test suite to over tests, generally stable failure levels with periodic bursts, and only small, inconsistent effects from churn and kernel changes—though kernel changes can modestly raise test failures in some periods. The work highlights the value and limitations of extracting empirical insights from large, evolving test suites in OS software and suggests directions for deeper, period-specific investigations.

Abstract

The paper presents a longitudinal empirical analysis of the automated, continuous, and virtualization-based software test suite of the NetBSD operating system. The longitudinal period observed spans from the initial roll out of the test suite in the early 2010s to late 2025. According to the results, the test suite has grown continuously, currently covering over ten thousand individual test cases. Failed test cases exhibit overall stability, although there have been shorter periods marked with more frequent failures. A similar observation applies to build failures, failures of the test suite to complete, and installation failures, all of which are also captured by the NetBSD's testing framework. Finally, code churn and kernel modifications do not provide longitudinally consistent statistical explanations for the failures. Although some periods exhibit larger effects, including particularly with respect to the kernel modifications, the effects are small on average. Even though only in an exploratory manner, these empirical observations contribute to efforts to draw conclusions from large-scale and evolving software test suites.

Paper Structure

This paper contains 14 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Monthly Sample Sizes
  • Figure 2: Test Cases and Failed Test Cases (monthly averages)
  • Figure 3: Build, Test Suite, and Installation Failures (monthly sums)
  • Figure 4: Regression Results for the amd64 Architecture (separate monthly estimates; coefficients or marginal effects of the coefficients)
  • Figure 5: Regression Results for the i386 Architecture (separate monthly estimates; coefficients or marginal effects of the coefficients)
  • ...and 1 more figures