The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions

Simiao Jiao; Yihong Wu; Jiaming Xu

The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions

Simiao Jiao, Yihong Wu, Jiaming Xu

TL;DR

This work resolves the sharp detection threshold for the broken sample problem under proportional sampling and unequal sizes by linking strong detection to two information measures: the $\chi^2$-information $I_{\chi^2}(X;Y)$ and the Hirschfeld-Gebelein-Rényi maximal correlation $\rho(X;Y)$. It develops two computationally efficient tests based on the spectrum of the likelihood ratio operator and histogram embeddings, establishing sufficiency and providing rate-optimal performance in broad settings, including fixed and high-dimensional regimes. The authors fully characterize impossibility regimes via a second-moment analysis and derive limiting distributions for key test statistics, connecting to Bai-Hsing's conjecture and extending to Gaussian and Bernoulli models with precise thresholds. Beyond exact thresholds, the paper offers practical testing tools with linear or near-linear runtimes and demonstrates power close to the optimum, with Wasserstein-based methods discussed as universal alternatives. Together, these results advance the theory of permutation-invariant detection in de-anonymization, data integration, and related shuffled-record problems, and they pave the way for handling unknown joint distributions and highly imbalanced data in future work.

Abstract

We revisit the classical broken sample problem: Two samples of i.i.d. data points $\mathbf{X}=\{X_1,\cdots, X_n\}$ and $\mathbf{Y}=\{Y_1,\cdots,Y_m\}$ are observed without correspondence with $m\leq n$. Under the null hypothesis, $\mathbf{X}$ and $\mathbf{Y}$ are independent. Under the alternative hypothesis, $\mathbf{Y}$ is correlated with a random subsample of $\mathbf{X}$, in the sense that $(X_{π(i)},Y_i)$'s are drawn independently from some bivariate distribution for some latent injection $π:[m] \to [n]$. Originally introduced by DeGroot, Feder, and Goel (1971) to model matching records in census data, this problem has recently gained renewed interest due to its applications in data de-anonymization, data integration, and target tracking. Despite extensive research over the past decades, determining the precise detection threshold has remained an open problem even for equal sample sizes ($m=n$). Assuming $m$ and $n$ grow proportionally, we show that the sharp threshold is given by a spectral and an $L_2$ condition of the likelihood ratio operator, resolving a conjecture of Bai and Hsing (2005) in the positive. These results are extended to high dimensions and settle the sharp detection thresholds for Gaussian and Bernoulli models.

The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions

TL;DR

This work resolves the sharp detection threshold for the broken sample problem under proportional sampling and unequal sizes by linking strong detection to two information measures: the

-information

and the Hirschfeld-Gebelein-Rényi maximal correlation

. It develops two computationally efficient tests based on the spectrum of the likelihood ratio operator and histogram embeddings, establishing sufficiency and providing rate-optimal performance in broad settings, including fixed and high-dimensional regimes. The authors fully characterize impossibility regimes via a second-moment analysis and derive limiting distributions for key test statistics, connecting to Bai-Hsing's conjecture and extending to Gaussian and Bernoulli models with precise thresholds. Beyond exact thresholds, the paper offers practical testing tools with linear or near-linear runtimes and demonstrates power close to the optimum, with Wasserstein-based methods discussed as universal alternatives. Together, these results advance the theory of permutation-invariant detection in de-anonymization, data integration, and related shuffled-record problems, and they pave the way for handling unknown joint distributions and highly imbalanced data in future work.

Abstract

We revisit the classical broken sample problem: Two samples of i.i.d. data points

and

are observed without correspondence with

. Under the null hypothesis,

and

are independent. Under the alternative hypothesis,

is correlated with a random subsample of

, in the sense that

's are drawn independently from some bivariate distribution for some latent injection

. Originally introduced by DeGroot, Feder, and Goel (1971) to model matching records in census data, this problem has recently gained renewed interest due to its applications in data de-anonymization, data integration, and target tracking. Despite extensive research over the past decades, determining the precise detection threshold has remained an open problem even for equal sample sizes (

). Assuming

and

grow proportionally, we show that the sharp threshold is given by a spectral and an

condition of the likelihood ratio operator, resolving a conjecture of Bai and Hsing (2005) in the positive. These results are extended to high dimensions and settle the sharp detection thresholds for Gaussian and Bernoulli models.

The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions

TL;DR

Abstract

The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (32)