Ensemble Threshold Calibration for Stable Sensitivity Control
John N. Daras
TL;DR
This paper tackles exact recall calibration for large-scale geospatial entity matching under skewed score distributions. It introduces a TPU-friendly end-to-end pipeline combining an equigrid CSR-based filter, a one-shot neural ranker, and a deterministic XXHASH-based decile sampler to create a reproducible calibration set. Four threshold rules (Clopper–Pearson, Jeffreys, Wilson, and exact quantile) are fused via inverse-variance weighting, and aggregated across nine independent subsamples to sharply reduce recall variance; the final threshold is chosen as the minimum across subsamples. Empirical results on four cadastral datasets demonstrate recall targets are met with sub-percent variability and modest calibration overhead, with end-to-end runtimes under 4 minutes on a single TPU v3 core and total end-to-end times under approximately 80 minutes for large-scale runs. The approach offers robust, scalable recall calibration with practical verification costs for large-scale spatial matching tasks.
Abstract
Precise recall control is critical in large-scale spatial conflation and entity-matching tasks, where missing even a few true matches can break downstream analytics, while excessive manual review inflates cost. Classical confidence-interval cuts such as Clopper-Pearson or Wilson provide lower bounds on recall, but they routinely overshoot the target by several percentage points and exhibit high run-to-run variance under skewed score distributions. We present an end-to-end framework that achieves exact recall with sub-percent variance over tens of millions of geometry pairs, while remaining TPU-friendly. Our pipeline starts with an equigrid bounding-box filter and compressed sparse row (CSR) candidate representation, reducing pair enumeration by two orders of magnitude. A deterministic xxHash bootstrap sample trains a lightweight neural ranker; its scores are propagated to all remaining pairs via a single forward pass and used to construct a reproducible, score-decile-stratified calibration set. Four complementary threshold estimators - Clopper-Pearson, Jeffreys, Wilson, and an exact quantile - are aggregated via inverse-variance weighting, then fused across nine independent subsamples. This ensemble reduces threshold variance compared to any single method. Evaluated on two real cadastral datasets (approximately 6.31M and 67.34M pairs), our approach consistently hits a recall target within a small error, decreases redundant verifications relative to other calibrations, and runs end-to-end on a single TPU v3 core.
