Ever-Improving Test Suite by Leveraging Large Language Models
Ketai Qiu
TL;DR
The paper tackles the challenge of keeping software test suites aligned with real-world production usage by proposing E-Test, an LLM-assisted pipeline that continuously augments tests using not-yet-tested production scenarios. E-Test instruments unit methods, uses a PreProcessor-Analyzer-PostProcessor workflow to classify observed scenarios into already-tested, not-yet-tested, and error-prone, and generates tests for the not-yet-tested cases. It demonstrates strong empirical performance, showing higher precision, recall, and F1 than state-of-the-art approaches like FAST++ and field-ready testing, and reports that not-yet-tested scenarios can reveal a majority of failures in case studies. The contributions include a novel methodology for production-aware test-suite expansion, a 1,975-scenario dataset, extensive experiments across multiple LLM configurations with a replication package, and practical implications for improving long-term software reliability without excessive test-suite bloat.
Abstract
Augmenting test suites with test cases that reflect the actual usage of the software system is extremely important to sustain the quality of long lasting software systems. In this paper, we propose E-Test, an approach that incrementally augments a test suite with test cases that exercise behaviors that emerge in production and that are not been tested yet. E-Test leverages Large Language Models to identify already-tested, not-yet-tested, and error-prone unit execution scenarios, and augment the test suite accordingly. Our experimental evaluation shows that E-Test outperforms the main state-of-the-art approaches to identify inadequately tested behaviors and optimize test suites.
