Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course
Marina Filipovic, Fabian Gilson
TL;DR
The paper investigates how different testing metrics relate to external software quality, defined as functional suitability per $ISO25010$. Using a year-long Scrum-based course with eight teams and six sprints, it combines code-coverage, automated and manual acceptance testing, and time-effort data, analyzed through Linear Mixed-Effects Models. A key finding is that the average testing coverage across unit, automated acceptance, and manual testing ($AVGTC$) strongly predicts passed story points ($PSP$), supporting the testing-pyramid concept; front-end unit testing ($FEUT$) and manual acceptance testing ($MAT$) also show significant individual effects, while back-end unit testing ($BEUT$) may negatively relate to PSP. Time-effort data were largely unreliable due to logging issues, and no robust link was found between testing effort and subsequent fixing or deferral behaviors. The work highlights practical implications for teaching and practicing more balanced testing strategies in CI-enabled, team-based development, while acknowledging several methodological limitations and directions for improvement in data collection and ATDD adoption.
Abstract
Testing plays an important role in securing the success of a software development project. Prior studies have demonstrated beneficial effects of applying acceptance testing within a Behavioural-Driven Development method. In this research, we investigate whether we can quantify the effects various types of testing have on functional suitability, i.e. the software conformance to users' functional expectations. We explore which combination of software testing (automated and manual, including acceptance testing) should be applied to ensure the expected functional requirements are met, as well as whether the lack of testing during a development iteration causes a significant increase of effort spent fixing the project later on. To answer those questions, we collected and analysed data from a year-long software engineering project course. We combined manual observations and statistical methods, namely Linear Mixed-Effects Modelling, to evaluate the effects of coverage metrics as well as time effort on passed stories over 5 Scrum sprints. The results suggest that a combination of a high code coverage for all of automated unit, acceptance, and manual testing has a significant impact on functional suitability. Similarly, but to a lower extent, front-end unit testing and manual testing can predict the success of a software delivery when taken independently. We observed a close-to-significant effect between low back-end testing and deferral (i.e. postponement) of user stories.
