Unit Testing Past vs. Present: Examining LLMs' Impact on Defect Detection and Efficiency
Rudolf Ramler, Philipp Straubinger, Reinhold Plösch, Dietmar Winkler
TL;DR
This study evaluates whether large language models (LLMs) enhance defect detection during unit testing by replicating a classic 60-minute, time-boxed testing experiment on a Java SUT with seeded defects. It compares LLM-supported testing against manual testing, measuring test creation, branch coverage, defects found, and false positives. The results show that LLMs markedly increase the number of tests written and defects detected, with higher overall coverage, but also elevate the rate of false positives as test volume grows. The findings provide empirical evidence that LLMs can meaningfully improve unit testing efficiency and effectiveness, informing practice on how to integrate LLM support into software testing workflows.
Abstract
The integration of Large Language Models (LLMs), such as ChatGPT and GitHub Copilot, into software engineering workflows has shown potential to enhance productivity, particularly in software testing. This paper investigates whether LLM support improves defect detection effectiveness during unit testing. Building on prior studies comparing manual and tool-supported testing, we replicated and extended an experiment where participants wrote unit tests for a Java-based system with seeded defects within a time-boxed session, supported by LLMs. Comparing LLM supported and manual testing, results show that LLM support significantly increases the number of unit tests generated, defect detection rates, and overall testing efficiency. These findings highlight the potential of LLMs to improve testing and defect detection outcomes, providing empirical insights into their practical application in software testing.
