Evaluating Search-Based Software Microbenchmark Prioritization
Christoph Laaber, Tao Yue, Shaukat Ali
TL;DR
This study tackles performance regression in software microbenchmarks by comparing search-based prioritization against greedy baselines. It introduces three proxy objectives—Coverage (C), Coverage Overlap (CO), and historical Change (CH)—and evaluates both single- and multi-objective formulations using a large real-world Java dataset and rigorous statistical analysis. The results show that a simple greedy approach based solely on historical performance changes (GreedyCH) is often as effective or better than the more complex coverage-based search methods, with substantially lower overhead; the best multi-objective method (C-CO-CH) is only competitive with the strongest greedy baselines. These findings suggest that non-coverage-based techniques, particularly those exploiting historical change data, are more practical for microbenchmarks. The work provides practical guidance for practitioners and opens avenues for future research into alternative proxies for performance changes beyond code coverage.
Abstract
Ensuring that software performance does not degrade after a code change is paramount. A solution is to regularly execute software microbenchmarks, a performance testing technique similar to (functional) unit tests, which, however, often becomes infeasible due to extensive runtimes. To address that challenge, research has investigated regression testing techniques, such as test case prioritization (TCP), which reorder the execution within a microbenchmark suite to detect larger performance changes sooner. Such techniques are either designed for unit tests and perform sub-par on microbenchmarks or require complex performance models, drastically reducing their potential application. In this paper, we empirically evaluate single- and multi-objective search-based microbenchmark prioritization techniques to understand whether they are more effective and efficient than greedy, coverage-based techniques. For this, we devise three search objectives, i.e., coverage to maximize, coverage overlap to minimize, and historical performance change detection to maximize. We find that search algorithms (SAs) are only competitive with but do not outperform the best greedy, coverage-based baselines. However, a simple greedy technique utilizing solely the performance change history (without coverage information) is equally or more effective than the best coverage-based techniques while being considerably more efficient, with a runtime overhead of less than 1%. These results show that simple, non-coverage-based techniques are a better fit for microbenchmarks than complex coverage-based techniques.
