Table of Contents
Fetching ...

The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions

Simiao Jiao, Yihong Wu, Jiaming Xu

TL;DR

This work resolves the sharp detection threshold for the broken sample problem under proportional sampling and unequal sizes by linking strong detection to two information measures: the $\chi^2$-information $I_{\chi^2}(X;Y)$ and the Hirschfeld-Gebelein-Rényi maximal correlation $\rho(X;Y)$. It develops two computationally efficient tests based on the spectrum of the likelihood ratio operator and histogram embeddings, establishing sufficiency and providing rate-optimal performance in broad settings, including fixed and high-dimensional regimes. The authors fully characterize impossibility regimes via a second-moment analysis and derive limiting distributions for key test statistics, connecting to Bai-Hsing's conjecture and extending to Gaussian and Bernoulli models with precise thresholds. Beyond exact thresholds, the paper offers practical testing tools with linear or near-linear runtimes and demonstrates power close to the optimum, with Wasserstein-based methods discussed as universal alternatives. Together, these results advance the theory of permutation-invariant detection in de-anonymization, data integration, and related shuffled-record problems, and they pave the way for handling unknown joint distributions and highly imbalanced data in future work.

Abstract

We revisit the classical broken sample problem: Two samples of i.i.d. data points $\mathbf{X}=\{X_1,\cdots, X_n\}$ and $\mathbf{Y}=\{Y_1,\cdots,Y_m\}$ are observed without correspondence with $m\leq n$. Under the null hypothesis, $\mathbf{X}$ and $\mathbf{Y}$ are independent. Under the alternative hypothesis, $\mathbf{Y}$ is correlated with a random subsample of $\mathbf{X}$, in the sense that $(X_{π(i)},Y_i)$'s are drawn independently from some bivariate distribution for some latent injection $π:[m] \to [n]$. Originally introduced by DeGroot, Feder, and Goel (1971) to model matching records in census data, this problem has recently gained renewed interest due to its applications in data de-anonymization, data integration, and target tracking. Despite extensive research over the past decades, determining the precise detection threshold has remained an open problem even for equal sample sizes ($m=n$). Assuming $m$ and $n$ grow proportionally, we show that the sharp threshold is given by a spectral and an $L_2$ condition of the likelihood ratio operator, resolving a conjecture of Bai and Hsing (2005) in the positive. These results are extended to high dimensions and settle the sharp detection thresholds for Gaussian and Bernoulli models.

The broken sample problem revisited: Proof of a conjecture by Bai-Hsing and high-dimensional extensions

TL;DR

This work resolves the sharp detection threshold for the broken sample problem under proportional sampling and unequal sizes by linking strong detection to two information measures: the -information and the Hirschfeld-Gebelein-Rényi maximal correlation . It develops two computationally efficient tests based on the spectrum of the likelihood ratio operator and histogram embeddings, establishing sufficiency and providing rate-optimal performance in broad settings, including fixed and high-dimensional regimes. The authors fully characterize impossibility regimes via a second-moment analysis and derive limiting distributions for key test statistics, connecting to Bai-Hsing's conjecture and extending to Gaussian and Bernoulli models with precise thresholds. Beyond exact thresholds, the paper offers practical testing tools with linear or near-linear runtimes and demonstrates power close to the optimum, with Wasserstein-based methods discussed as universal alternatives. Together, these results advance the theory of permutation-invariant detection in de-anonymization, data integration, and related shuffled-record problems, and they pave the way for handling unknown joint distributions and highly imbalanced data in future work.

Abstract

We revisit the classical broken sample problem: Two samples of i.i.d. data points and are observed without correspondence with . Under the null hypothesis, and are independent. Under the alternative hypothesis, is correlated with a random subsample of , in the sense that 's are drawn independently from some bivariate distribution for some latent injection . Originally introduced by DeGroot, Feder, and Goel (1971) to model matching records in census data, this problem has recently gained renewed interest due to its applications in data de-anonymization, data integration, and target tracking. Despite extensive research over the past decades, determining the precise detection threshold has remained an open problem even for equal sample sizes (). Assuming and grow proportionally, we show that the sharp threshold is given by a spectral and an condition of the likelihood ratio operator, resolving a conjecture of Bai and Hsing (2005) in the positive. These results are extended to high dimensions and settle the sharp detection thresholds for Gaussian and Bernoulli models.

Paper Structure

This paper contains 23 sections, 12 theorems, 86 equations, 3 figures, 2 tables.

Key Result

Theorem 1

For likelihood ratio ${\mathbf{L}}({\mathbf{X}},{\mathbf{Y}})$ corresponding to problem eq:problem formulation, where $t_\ell = \binom{n-\ell-1}{m-\ell}/\binom{n}{m}$ satisfying $\sum_{\ell=0}^m t_\ell=1$ and $a_\ell$ is the $\ell$th coefficient in the power series of $\prod_{k=0}^\infty 1/(1-z \lambda^2_k)$.

Figures (3)

  • Figure 1: ROC curves of various tests for the broken bivariate Gaussian samples $(d=1)$.
  • Figure 2: Power curves of various tests for the broken bivariate Gaussian samples $(d=1)$ and Type-I error fixed at 0.05. Left: equal sample sizes $m=n$. Right: unequal sample sizes $m=n/2$.
  • Figure 3: An instance of the bipartite graph $G_\pi$ for $n=8$ and $m=6$. The edges $(i,i)$ and $(i,\pi(i))$ are in blue (dashed) and red (solid), respectively. Here, the 2-core is a 6-cycle with $I=\{1,2,3\}$. After removing this 2-core, the remaining graph consists two disjoint paths of lengths 2 and 4.

Theorems & Definitions (32)

  • Definition 1: Broken sample detection
  • Example 1: Gaussian model
  • Example 2: Bernoulli model
  • Theorem 1
  • Corollary 1
  • Remark 1
  • Remark 2
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • ...and 22 more