On Rank Aggregating Test Prioritizations
Shouvick Mondal, Tse-Hsun Chen
TL;DR
The paper addresses the challenge of robust test-case prioritization by introducing Ensemble Test Prioritization (EnTP), a three-stage pipeline that combines diversity-based ensemble selection with social-choice based rank aggregation to derive consensus prioritizations for system-level regression tests. It leverages 16 standalone heuristics to form a 25-permutation ensemble, uses Kendall-tau distance to select diverse subsets, and aggregates them via Kemeny-Young, Borda-count, or mean/median methods to schedule tests. Empirical evaluation on 20 open-source C projects (694,512 SLOC, 280 versions, 69,305 test-cases) shows that EnTP with a top-75% diversity budget often outperforms standalone heuristics and state-of-the-art approaches, particularly in cost-aware metrics like $APFD_c$ under highly imbalanced test costs. The work demonstrates the practical value of consensus-based TCP and provides public artifacts to support replication, with future directions including broader benchmarks, CI integration, and deeper exploration of domain-aware diversity. $
Abstract
Test case prioritization (TCP) has been an effective strategy to optimize regression testing. Traditionally, test cases are ordered based on some heuristic and rerun against the version under test with the goal of yielding a high failure throughput. Almost four decades of TCP research has seen extensive contributions in the light of individual prioritization strategies. However, test case prioritization via preference aggregation has largely been unexplored. We envision this methodology as an opportunity to obtain robust prioritizations by consolidating multiple standalone ranked lists, i.e., performing a consensus. In this work, we propose Ensemble Test Prioritization (EnTP) as a three stage pipeline involving: (i) ensemble selection, (ii) rank aggregation, and (iii) test case execution. We evaluate EnTP on 20 open-source C projects from the Software-artifact Infrastructure Repository and GitHub (totaling: 694,512 SLOC, 280 versions, and 69,305 system level test-cases). We employ an ensemble of 16 standalone prioritization plans, four of which are imposed due to respective state-of-the-art approaches. We build EnTP on the foundations of Hansie, an existing framework on consensus prioritization and show that EnTP's diversity based ensemble selection budget of top-75% followed by rank aggregation can outperform Hansie, and the employed standalone prioritization approaches.
