Table of Contents
Fetching ...

Geometric planted matchings beyond the Gaussian model

Lucas da Rocha Schwengber, Roberto Imbuzeiro Oliveira

TL;DR

The paper studies the problem of recovering a planted permutation between two snapshots of $n$ points in $\mathbb{R}^d$ under random perturbations, a model applicable to particle tracking and entity resolution. It develops minimax lower bounds via matchings in random geometric graphs and analyzes the Least Sum of Squares (LSS) estimator, proving minimax-optimal rates in low dimensions and near-optimal behavior in certain high-dimensional regimes; it also introduces a covariance-aware variant, LSS-C, with enhanced guarantees in anisotropic, high-dimensional settings. The results extend beyond the Gaussian model to broad distributions with independent sub-Gaussian coordinates, and establish explicit conditions under which perfect recovery is possible in high dimensions. Overall, the work quantifies how geometry, dimension, and noise interact to govern recoverability, providing practical criteria for exact recovery in large-scale, high-dimensional data association tasks.

Abstract

We consider the problem of recovering an unknown matching between a set of $n$ randomly placed points in $\mathbb{R}^d$ and random perturbations of these points. This can be seen as a model for particle tracking and more generally, entity resolution. We use matchings in random geometric graphs to derive minimax lower bounds for this problem that hold under great generality. Using these results we show that for a broad class of distributions, the order of the number of mistakes made by an estimator that minimizes the sum of squared Euclidean distances is minimax optimal when $d$ is fixed and is optimal up to $n^{o(1)}$ factors when $d = o(\log n)$. In the high-dimensional regime we consider a setup where both initial positions and perturbations have independent sub-Gaussian coordinates. In this setup we give sufficient conditions under which the same estimator makes no mistakes with high probability. We prove an analogous result for an adapted version of this estimator that incorporates information on the covariance matrix of the perturbations.

Geometric planted matchings beyond the Gaussian model

TL;DR

The paper studies the problem of recovering a planted permutation between two snapshots of points in under random perturbations, a model applicable to particle tracking and entity resolution. It develops minimax lower bounds via matchings in random geometric graphs and analyzes the Least Sum of Squares (LSS) estimator, proving minimax-optimal rates in low dimensions and near-optimal behavior in certain high-dimensional regimes; it also introduces a covariance-aware variant, LSS-C, with enhanced guarantees in anisotropic, high-dimensional settings. The results extend beyond the Gaussian model to broad distributions with independent sub-Gaussian coordinates, and establish explicit conditions under which perfect recovery is possible in high dimensions. Overall, the work quantifies how geometry, dimension, and noise interact to govern recoverability, providing practical criteria for exact recovery in large-scale, high-dimensional data association tasks.

Abstract

We consider the problem of recovering an unknown matching between a set of randomly placed points in and random perturbations of these points. This can be seen as a model for particle tracking and more generally, entity resolution. We use matchings in random geometric graphs to derive minimax lower bounds for this problem that hold under great generality. Using these results we show that for a broad class of distributions, the order of the number of mistakes made by an estimator that minimizes the sum of squared Euclidean distances is minimax optimal when is fixed and is optimal up to factors when . In the high-dimensional regime we consider a setup where both initial positions and perturbations have independent sub-Gaussian coordinates. In this setup we give sufficient conditions under which the same estimator makes no mistakes with high probability. We prove an analogous result for an adapted version of this estimator that incorporates information on the covariance matrix of the perturbations.
Paper Structure (24 sections, 23 theorems, 195 equations, 2 figures)

This paper contains 24 sections, 23 theorems, 195 equations, 2 figures.

Key Result

Theorem 1.1

Under mod:low_dim, for all $n \geq 3$, where the infimum above is taken over all estimators $\hat{\pi}$.

Figures (2)

  • Figure 1: A small window of a simulation of the performance of $\hat{\pi}_{\mathrm{LSS}}$ on a sample following \ref{['eq:final_pos']} of total size $n=3000$ with $X_1 \sim \mathcal{N}(0,I_d)$, $Z_1 \sim \mathcal{N}(0,\sigma^2 I_d)$ for two different values of $\sigma^2$. Arrows indicate to which initial position $X_i$ a given $Y_j$ was associated by $\hat{\pi}_{\mathrm{LSS}}$. Red arrows indicate that the estimated match for the given point was incorrect, while grey arrows indicate that the estimated match was correct. Blue lines between pairs of initial positions indicate that they are within a distance of $r = \sqrt{2}\sigma$ of each other. Note that most incorrect matches involve initial positions that share an edge on $G(n,r,d,\left\| {\,\cdot\,} \right\|_2)$. Our strategy to prove our lower bounds takes inspiration from this observation. The code to generate this image can be found at https://github.com/Lucas-Schwengber/particle_tracking.
  • Figure 2: Simulated experiments to estimate the error rate $\frac{\mathbb{E}\left[ d_H\left({\hat{\pi}_{\mathrm{LSS}}},{\pi^{\star}}\right) \right]}{n}$ as a function of $\sigma^2$ for different choices of $\mathcal{Q}$. We consider \ref{['eq:final_pos']} with $X_1 \sim \mathcal{N}(0,I_d)$ and $Z_1 = \sigma \tilde{Z}_1$ with four different choices for the distribution of $\tilde{Z}_1$: $\tilde{Z}_1 \sim \mathcal{N}(0,I_d)$ (Gaussian), $\tilde{Z}_1 \sim \text{Unif}(\mathbb{S}^{d-1})$ (Spherical), $\tilde{Z}_1 \sim \text{Unif}([-\sqrt{3},\sqrt{3}]^d)$ (Uniform) and $\tilde{Z}_1 \sim \text{Unif}(\{-1,1\}^d)$ (Rademacher). We also plot $\tau n^2 \sigma^d$ (Gaussian prediction) for the value of $\tau$ from \ref{['thm:lss_up_fix_d']} obtained for the Gaussian model. Although some of these distributions do not directly satisfy the assumptions of \ref{['thm:lss_up_fix_d']}, we will see in \ref{['sec:up_bound_ld']} that the same result holds under weaker assumptions which includes all examples shown in the figure. For choice of $\mathcal{Q}$ in the simulations we considered $n=100, \, d=2,3$ and took the average of $\frac{d_H\left({\hat{\pi}_{\mathrm{LSS}}},{\pi^{\star}}\right)}{n}$ over $10000$ independent trials. Both axis are plotted using log-scale. The code to reproduce the figures can also be found at https://github.com/Lucas-Schwengber/particle_tracking.

Theorems & Definitions (50)

  • Theorem 1.1: Minimax lower bound (proof in § \ref{['sub:minimaxlowerbound']})
  • Lemma 1.2: Largest matching size lower bound (proof in § \ref{['sec:largest_matching_rgg']})
  • Remark 1.3
  • Remark 1.4
  • Remark 1.5
  • Remark 1.6
  • Theorem 1.7: LSS error upper bound (proof in § \ref{['sec:up_bound_ld']})
  • Theorem 1.8: Perfect recovery for LSS in high-dimensions (proof in § \ref{['sec:up_bound_hd']})
  • Theorem 1.9: Perfect recovery for LSS-C in high-dimensions (proof in § \ref{['sec:up_bound_hd']})
  • Proposition 2.1: Low-dimension, Gaussian initial positions, Gaussian noise
  • ...and 40 more