SAT-sampling for statistical significance testing in sparse contingency tables
Patrick Scharpfenecker, Tobias Windisch
TL;DR
This work tackles exact conditional inference for contingency tables by sampling from the fiber $\mathcal{F}_{A,b}$ with fixed margins $b=A u^{obs}$, addressing the practical infeasibility of full Markov-basis methods in sparse or structurally constrained tables. It introduces a SAT-based fiber encoding via Boolean circuits and Tseitin transformations, enabling (almost) uniform sampling with SAT samplers and leveraging these samples within hybrid Metropolis-Hastings schemes to preserve the correct stationary distribution. The authors propose two hybrid strategies, $\mathrm{A}_{n}({\mathcal{M}})$ and $\mathrm{P}_{n, k}({\mathcal{M}})$, to combine SAT proposals with local Markov moves and to mitigate SAT-induced biases. Across benchmarks on $I$, $QI$, and $N3F$ models, including highly constrained cases, SAT-based methods deliver reliable conditional $p$-values and often outperform precomputed Markov-basis samplers, offering a scalable alternative when full Markov bases are infeasible. This approach reduces dependency on Markov bases, enabling robust conditional testing in sparse and structurally zero-laden contingency tables.
Abstract
Exact conditional tests for contingency tables require sampling from fibers with fixed margins. Classical Markov basis MCMC is general but often impractical: computing full Markov bases that connect all fibers of a given constraint matrix can be infeasible and the resulting chains may converge slowly, especially in sparse settings or in presence of structural zeros. We introduce a SAT-based alternative that encodes fibers as Boolean circuits which allows modern SAT samplers to generate tables randomly. We analyze the sampling bias that SAT samplers may introduce, provide diagnostics, and propose practical mitigation. We propose hybrid MCMC schemes that combine SAT proposals with local moves to ensure correct stationary distributions which do not necessarily require connectivity via local moves which is particularly beneficial in presence of structural zeros. Across benchmarks, including small and involved tables with many structural zeros where pure Markov-basis methods underperform, our methods deliver reliable conditional p-values and often outperform samplers that rely on precomputed Markov bases.
