Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course

Marina Filipovic; Fabian Gilson

Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course

Marina Filipovic, Fabian Gilson

TL;DR

The paper investigates how different testing metrics relate to external software quality, defined as functional suitability per $ISO25010$. Using a year-long Scrum-based course with eight teams and six sprints, it combines code-coverage, automated and manual acceptance testing, and time-effort data, analyzed through Linear Mixed-Effects Models. A key finding is that the average testing coverage across unit, automated acceptance, and manual testing ($AVGTC$) strongly predicts passed story points ($PSP$), supporting the testing-pyramid concept; front-end unit testing ($FEUT$) and manual acceptance testing ($MAT$) also show significant individual effects, while back-end unit testing ($BEUT$) may negatively relate to PSP. Time-effort data were largely unreliable due to logging issues, and no robust link was found between testing effort and subsequent fixing or deferral behaviors. The work highlights practical implications for teaching and practicing more balanced testing strategies in CI-enabled, team-based development, while acknowledging several methodological limitations and directions for improvement in data collection and ATDD adoption.

Abstract

Testing plays an important role in securing the success of a software development project. Prior studies have demonstrated beneficial effects of applying acceptance testing within a Behavioural-Driven Development method. In this research, we investigate whether we can quantify the effects various types of testing have on functional suitability, i.e. the software conformance to users' functional expectations. We explore which combination of software testing (automated and manual, including acceptance testing) should be applied to ensure the expected functional requirements are met, as well as whether the lack of testing during a development iteration causes a significant increase of effort spent fixing the project later on. To answer those questions, we collected and analysed data from a year-long software engineering project course. We combined manual observations and statistical methods, namely Linear Mixed-Effects Modelling, to evaluate the effects of coverage metrics as well as time effort on passed stories over 5 Scrum sprints. The results suggest that a combination of a high code coverage for all of automated unit, acceptance, and manual testing has a significant impact on functional suitability. Similarly, but to a lower extent, front-end unit testing and manual testing can predict the success of a software delivery when taken independently. We observed a close-to-significant effect between low back-end testing and deferral (i.e. postponement) of user stories.

Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course

TL;DR

The paper investigates how different testing metrics relate to external software quality, defined as functional suitability per

. Using a year-long Scrum-based course with eight teams and six sprints, it combines code-coverage, automated and manual acceptance testing, and time-effort data, analyzed through Linear Mixed-Effects Models. A key finding is that the average testing coverage across unit, automated acceptance, and manual testing (

) strongly predicts passed story points (

), supporting the testing-pyramid concept; front-end unit testing (

) and manual acceptance testing (

) also show significant individual effects, while back-end unit testing (

) may negatively relate to PSP. Time-effort data were largely unreliable due to logging issues, and no robust link was found between testing effort and subsequent fixing or deferral behaviors. The work highlights practical implications for teaching and practicing more balanced testing strategies in CI-enabled, team-based development, while acknowledging several methodological limitations and directions for improvement in data collection and ATDD adoption.

Abstract

Paper Structure (83 sections, 3 figures, 12 tables)

This paper contains 83 sections, 3 figures, 12 tables.

Introduction
Related Work
Secondary Studies on Testing Practices
Test- and Behaviour-Driven Development
In academia
Our contribution
In the industry
Our contribution
Testing Effort And External Quality
Our contribution
Continuous Integration and Testing
Our contribution
Method
Research Questions
Study Design
...and 68 more sections

Figures (3)

Figure 1: Evolution of testing coverage metrics and accepted stories for Team A.
Figure 2: Evolution of testing coverage metrics and accepted stories for Team H.
Figure 3: Effect of Total Testing Effort (TTE) on next sprint's fixing effort (Fix.next), per sprint.

Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course

TL;DR

Abstract

Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course

Authors

TL;DR

Abstract

Table of Contents

Figures (3)