The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions
Simiao Jiao, Yihong Wu, Jiaming Xu
TL;DR
This work resolves the sharp detection threshold for the broken sample problem under proportional sampling and unequal sizes by linking strong detection to two information measures: the $\chi^2$-information $I_{\chi^2}(X;Y)$ and the Hirschfeld-Gebelein-Rényi maximal correlation $\rho(X;Y)$. It develops two computationally efficient tests based on the spectrum of the likelihood ratio operator and histogram embeddings, establishing sufficiency and providing rate-optimal performance in broad settings, including fixed and high-dimensional regimes. The authors fully characterize impossibility regimes via a second-moment analysis and derive limiting distributions for key test statistics, connecting to Bai-Hsing's conjecture and extending to Gaussian and Bernoulli models with precise thresholds. Beyond exact thresholds, the paper offers practical testing tools with linear or near-linear runtimes and demonstrates power close to the optimum, with Wasserstein-based methods discussed as universal alternatives. Together, these results advance the theory of permutation-invariant detection in de-anonymization, data integration, and related shuffled-record problems, and they pave the way for handling unknown joint distributions and highly imbalanced data in future work.
Abstract
We revisit the classical broken sample problem: Two samples of i.i.d. data points $\mathbf{X}=\{X_1,\cdots, X_n\}$ and $\mathbf{Y}=\{Y_1,\cdots,Y_m\}$ are observed without correspondence with $m\leq n$. Under the null hypothesis, $\mathbf{X}$ and $\mathbf{Y}$ are independent. Under the alternative hypothesis, $\mathbf{Y}$ is correlated with a random subsample of $\mathbf{X}$, in the sense that $(X_{π(i)},Y_i)$'s are drawn independently from some bivariate distribution for some latent injection $π:[m] \to [n]$. Originally introduced by DeGroot, Feder, and Goel (1971) to model matching records in census data, this problem has recently gained renewed interest due to its applications in data de-anonymization, data integration, and target tracking. Despite extensive research over the past decades, determining the precise detection threshold has remained an open problem even for equal sample sizes ($m=n$). Assuming $m$ and $n$ grow proportionally, we show that the sharp threshold is given by a spectral and an $L_2$ condition of the likelihood ratio operator, resolving a conjecture of Bai and Hsing (2005) in the positive. These results are extended to high dimensions and settle the sharp detection thresholds for Gaussian and Bernoulli models.
