Testing Uniform Random Samplers: Methods, Datasets and Protocols
Olivier Zeyen, Maxime Cordy, Martin Gubri, Gilles Perrouin, Mathieu Acher
TL;DR
This work tackles the challenge of assessing uniformity in uniform random samplers (URS) for Boolean formulas, a key enabler for unbiased testing of highly configurable software systems. It introduces a practical testing framework with five statistical tests—Pearson's χ^2 GOF, monobit, variable frequency (VF), selected features per configuration (SFpC), and a birthday- paradox-based test—and methods to combine results across multiple formulae using Bonferroni and Harmonic Mean p-values. Through an extensive empirical study on real-world feature models, industrial formulae, and synthetic benchmarks, the authors show that most URS tools fail multiple tests, with UniGen3 emerging as the most uniform among those studied; they also reveal that the input formula set significantly impacts test outcomes. The paper emphasizes dataset diversity and multi-test consensus to reliably assess uniformity and cautions against over-reliance on single benchmarks or tests. Practically, the framework supports researchers in comparing URS candidates, catching implementation bugs, and guiding the development of scalable, provably uniform samplers. The authors also provide open-source data and tooling to enable reproducible, community-driven benchmarking.
Abstract
Boolean formulae compactly encode huge, constrained search spaces. Thus, variability-intensive systems are often encoded with Boolean formulae. The search space of a variability-intensive system is usually too large to explore without statistical inference (e.g. testing). Testing every valid configuration is computationally expensive (if not impossible) for most systems. This leads most testing approaches to sample a few configurations before analyzing them. A desirable property of such samples is uniformity: Each solution should have the same selection probability. Uniformity is the property that facilitates statistical inference. This property motivated the design of uniform random samplers, relying on SAT solvers and counters and achieving different trade-offs between uniformity and scalability. Though we can observe their performance in practice, judging the quality of the generated samples is different. Assessing the uniformity of a sampler is similar in nature to assessing the uniformity of a pseudo-random number (PRNG) generator. However, sampling is much slower and the nature of sampling also implies that the hyperspace containing the samples is constrained. This means that testing PRNGs is subject to fewer constraints than testing samplers. We propose a framework that contains five statistical tests which are suited to test uniform random samplers. Moreover, we demonstrate their use by testing seven samplers. Finally, we demonstrate the influence of the Boolean formula given as input to the samplers under test on the test results.
