Table of Contents
Fetching ...

The Most Dispersed Subset of Random Points in $\mathbb{R}^d$

Fabio Deelan Cunden, Noemi Cuppone, Giovanni Gramegna, Pierpaolo Vivo

TL;DR

The paper addresses the problem of selecting a subset of $M$ points from $N$ random points in $\mathbb{R}^d$ to maximize pairwise dispersion. It develops two complementary analytic pipelines—a mean-field treatment based on order statistics and a replica-method approach—to derive the full large-$N$ statistics of the maximal $M$-dispersion, including large-deviation tails via the SCGF and rate functions. A central finding is that the optimal subset, for any $d\ge 1$ with rotationally symmetric $F$, consists of points lying outside a self-consistently determined $d$-ball, with explicit radius relations, while in $d=1$ the finite-$N,M$ case admits exact prefix-suffix geometry and accurate balanced approximants. The results are validated against numerical simulations and heuristic greedy algorithms, and the dual methodologies reinforce each other by yielding the same SCGF in rotationally symmetric settings. These insights advance rigorous understanding of diversity optimization in high dimensions and offer benchmarks for dispersion-based heuristics.

Abstract

Consider a population of $N$ individuals, each having $d\geq 1$ different traits, and an additive measure, called dispersion, which rewards large pairwise separations between traits. The goal is to select $M\leq N$ individuals such that their traits are as dispersed as possible. We compute analytically the full statistics (including large deviation tails) of the maximally achievable dispersion among sub-populations of size $M$ when the traits are independent and identically distributed. Two complementary approaches are developed, one based on a mean-field theory for order statistics, and the other on the replica method from the field of disordered systems. In all dimensions $d$, and for rotationally symmetric distributions, the optimal subset for large populations consists of all points lying outside a $d$-dimensional ball whose radius is determined self-consistently. For a single trait ($d=1$), the statistics of the maximal dispersion can be tackled for finite $N,M$ as well. The formulae we obtained are corroborated by numerical simulations on small instances and by heuristic algorithms that find near-optimal solutions.

The Most Dispersed Subset of Random Points in $\mathbb{R}^d$

TL;DR

The paper addresses the problem of selecting a subset of points from random points in to maximize pairwise dispersion. It develops two complementary analytic pipelines—a mean-field treatment based on order statistics and a replica-method approach—to derive the full large- statistics of the maximal -dispersion, including large-deviation tails via the SCGF and rate functions. A central finding is that the optimal subset, for any with rotationally symmetric , consists of points lying outside a self-consistently determined -ball, with explicit radius relations, while in the finite- case admits exact prefix-suffix geometry and accurate balanced approximants. The results are validated against numerical simulations and heuristic greedy algorithms, and the dual methodologies reinforce each other by yielding the same SCGF in rotationally symmetric settings. These insights advance rigorous understanding of diversity optimization in high dimensions and offer benchmarks for dispersion-based heuristics.

Abstract

Consider a population of individuals, each having different traits, and an additive measure, called dispersion, which rewards large pairwise separations between traits. The goal is to select individuals such that their traits are as dispersed as possible. We compute analytically the full statistics (including large deviation tails) of the maximally achievable dispersion among sub-populations of size when the traits are independent and identically distributed. Two complementary approaches are developed, one based on a mean-field theory for order statistics, and the other on the replica method from the field of disordered systems. In all dimensions , and for rotationally symmetric distributions, the optimal subset for large populations consists of all points lying outside a -dimensional ball whose radius is determined self-consistently. For a single trait (), the statistics of the maximal dispersion can be tackled for finite as well. The formulae we obtained are corroborated by numerical simulations on small instances and by heuristic algorithms that find near-optimal solutions.
Paper Structure (19 sections, 126 equations, 7 figures, 1 table)

This paper contains 19 sections, 126 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: (a) $N=10$ random traits ('$\times$') on a line ($d=1$). In red, the subset that optimises the $M$-dispersion with $M=4$. The blue boxes indicate the subset that optimises the $M$-dispersion for $M=6$. The optimising subsets in $d=1$ are composed by two separate clusters for every $M$, comprising the $k$ leftmost and the $M-k$ rightmost variables for certain $k$'s. The optimal subset for $M+1$ includes the optimal subset for $M$, for any $M$. (b) $N=5$ random traits ('$\times$') on the plane ($d=2$). In red, the subset that optimises the $M$-dispersion with $M=2$. The blue boxes indicate the subset that optimises the $M$-dispersion for $M=3$. In dimension $d>1$, the optimal subset for $M+1$ does not necessarily include the optimal subset for $M$, as a global rearrangement could be more favorable in terms of increased dispersion.
  • Figure 2: Schematic representation of the various terms of the expression in Eq. \ref{['eqmaintext:expgbalanced']}.
  • Figure 3: Mean and variance of the maximal $M$-dispersion of uniform random variables obtained numerically from the true (not necessarily balanced) prefix-suffix optimiser, compared with the theoretical predictions for the balanced configuration in the large $N$-limit provided by equations \ref{['largeNasymptKappa1Uniform']} and \ref{['eq:var_divmax_unif']} (red line). The dotted line shows the predictions for the balanced configuration for the smallest value $N=10$, provided by equations \ref{['eq:DMax_mean_unif_finiteN']} and \ref{['eq:DMax_variance_unif_finiteN']} with $M=\lfloor \alpha N\rfloor$, which confirms that the balanced configuration is close to the optimal one already at such low value of $N$. The sample size for the estimation of mean and variance used in the numerics here is $10^4$.
  • Figure 4: Plots for $N=500$ points $(x,y)$ sampled according to the standard Gaussian distribution in $d=2$. The yellow points are producing the maximal $M$-dispersion for different values of $\alpha=M/N$ according to the greedy algorithm described in Sec. \ref{['sec:heuristics']}. The black dashed circle centered in the origin has radius $R(\alpha)$ given by \ref{['eq:RadiusFabio']}.
  • Figure 5: Mean and variance of $D_{\mathrm{max}}^M$ (with appropriate normalisation in $N$) for Gaussian points in $d=2$. The results obtained from the greedy algorithm with several values of $N$ are compared with the theoretical values (red line). The numerical estimates have been obtained from a sample of size $10^4$.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Example 1: Uniform distribution in an interval
  • Example 2
  • Example 3: Uniform distribution in $d=1$ at finite-$N$
  • Example 4: Gaussian distributions in $d>1$
  • Example 5: Uniform distribution in $d=1$
  • Example 6: Gaussian density