The Most Dispersed Subset of Random Points in $\mathbb{R}^d$

Fabio Deelan Cunden; Noemi Cuppone; Giovanni Gramegna; Pierpaolo Vivo

The Most Dispersed Subset of Random Points in $\mathbb{R}^d$

Fabio Deelan Cunden, Noemi Cuppone, Giovanni Gramegna, Pierpaolo Vivo

TL;DR

The paper addresses the problem of selecting a subset of $M$ points from $N$ random points in $\mathbb{R}^d$ to maximize pairwise dispersion. It develops two complementary analytic pipelines—a mean-field treatment based on order statistics and a replica-method approach—to derive the full large-$N$ statistics of the maximal $M$-dispersion, including large-deviation tails via the SCGF and rate functions. A central finding is that the optimal subset, for any $d\ge 1$ with rotationally symmetric $F$, consists of points lying outside a self-consistently determined $d$-ball, with explicit radius relations, while in $d=1$ the finite-$N,M$ case admits exact prefix-suffix geometry and accurate balanced approximants. The results are validated against numerical simulations and heuristic greedy algorithms, and the dual methodologies reinforce each other by yielding the same SCGF in rotationally symmetric settings. These insights advance rigorous understanding of diversity optimization in high dimensions and offer benchmarks for dispersion-based heuristics.

Abstract

Consider a population of $N$ individuals, each having $d\geq 1$ different traits, and an additive measure, called dispersion, which rewards large pairwise separations between traits. The goal is to select $M\leq N$ individuals such that their traits are as dispersed as possible. We compute analytically the full statistics (including large deviation tails) of the maximally achievable dispersion among sub-populations of size $M$ when the traits are independent and identically distributed. Two complementary approaches are developed, one based on a mean-field theory for order statistics, and the other on the replica method from the field of disordered systems. In all dimensions $d$, and for rotationally symmetric distributions, the optimal subset for large populations consists of all points lying outside a $d$-dimensional ball whose radius is determined self-consistently. For a single trait ($d=1$), the statistics of the maximal dispersion can be tackled for finite $N,M$ as well. The formulae we obtained are corroborated by numerical simulations on small instances and by heuristic algorithms that find near-optimal solutions.

The Most Dispersed Subset of Random Points in $\mathbb{R}^d$

TL;DR

The paper addresses the problem of selecting a subset of

points from

random points in

to maximize pairwise dispersion. It develops two complementary analytic pipelines—a mean-field treatment based on order statistics and a replica-method approach—to derive the full large-

statistics of the maximal

-dispersion, including large-deviation tails via the SCGF and rate functions. A central finding is that the optimal subset, for any

with rotationally symmetric

, consists of points lying outside a self-consistently determined

-ball, with explicit radius relations, while in

the finite-

case admits exact prefix-suffix geometry and accurate balanced approximants. The results are validated against numerical simulations and heuristic greedy algorithms, and the dual methodologies reinforce each other by yielding the same SCGF in rotationally symmetric settings. These insights advance rigorous understanding of diversity optimization in high dimensions and offer benchmarks for dispersion-based heuristics.

Abstract

Consider a population of

individuals, each having

different traits, and an additive measure, called dispersion, which rewards large pairwise separations between traits. The goal is to select

individuals such that their traits are as dispersed as possible. We compute analytically the full statistics (including large deviation tails) of the maximally achievable dispersion among sub-populations of size

when the traits are independent and identically distributed. Two complementary approaches are developed, one based on a mean-field theory for order statistics, and the other on the replica method from the field of disordered systems. In all dimensions

, and for rotationally symmetric distributions, the optimal subset for large populations consists of all points lying outside a

-dimensional ball whose radius is determined self-consistently. For a single trait (

), the statistics of the maximal dispersion can be tackled for finite

as well. The formulae we obtained are corroborated by numerical simulations on small instances and by heuristic algorithms that find near-optimal solutions.

Paper Structure (19 sections, 126 equations, 7 figures, 1 table)

This paper contains 19 sections, 126 equations, 7 figures, 1 table.

Introduction
Outline of the paper
Problem setting
$M$-Dispersion function
Statistics of maximal $M$-Dispersion and Scaled Cumulant Generating Function
Single trait ($d=1$)
Geometry of the optimal subset
Mean-field approach for large-$N,M$
Back to finite $N,M$
Any number of traits ($d \geq 1$) -- Mean field approach
Typical maximal $M$-dispersion
SCGF and Rate Function of the maximal $M$-dispersion
Any number of traits ($d\geq 1$) -- Replica approach
Averaging the replicated partition function over the disorder
SCGF of the maximal $M$-dispersion
...and 4 more sections

Figures (7)

Figure 1: (a) $N=10$ random traits ('$\times$') on a line ($d=1$). In red, the subset that optimises the $M$-dispersion with $M=4$. The blue boxes indicate the subset that optimises the $M$-dispersion for $M=6$. The optimising subsets in $d=1$ are composed by two separate clusters for every $M$, comprising the $k$ leftmost and the $M-k$ rightmost variables for certain $k$'s. The optimal subset for $M+1$ includes the optimal subset for $M$, for any $M$. (b) $N=5$ random traits ('$\times$') on the plane ($d=2$). In red, the subset that optimises the $M$-dispersion with $M=2$. The blue boxes indicate the subset that optimises the $M$-dispersion for $M=3$. In dimension $d>1$, the optimal subset for $M+1$ does not necessarily include the optimal subset for $M$, as a global rearrangement could be more favorable in terms of increased dispersion.
Figure 2: Schematic representation of the various terms of the expression in Eq. \ref{['eqmaintext:expgbalanced']}.
Figure 3: Mean and variance of the maximal $M$-dispersion of uniform random variables obtained numerically from the true (not necessarily balanced) prefix-suffix optimiser, compared with the theoretical predictions for the balanced configuration in the large $N$-limit provided by equations \ref{['largeNasymptKappa1Uniform']} and \ref{['eq:var_divmax_unif']} (red line). The dotted line shows the predictions for the balanced configuration for the smallest value $N=10$, provided by equations \ref{['eq:DMax_mean_unif_finiteN']} and \ref{['eq:DMax_variance_unif_finiteN']} with $M=\lfloor \alpha N\rfloor$, which confirms that the balanced configuration is close to the optimal one already at such low value of $N$. The sample size for the estimation of mean and variance used in the numerics here is $10^4$.
Figure 4: Plots for $N=500$ points $(x,y)$ sampled according to the standard Gaussian distribution in $d=2$. The yellow points are producing the maximal $M$-dispersion for different values of $\alpha=M/N$ according to the greedy algorithm described in Sec. \ref{['sec:heuristics']}. The black dashed circle centered in the origin has radius $R(\alpha)$ given by \ref{['eq:RadiusFabio']}.
Figure 5: Mean and variance of $D_{\mathrm{max}}^M$ (with appropriate normalisation in $N$) for Gaussian points in $d=2$. The results obtained from the greedy algorithm with several values of $N$ are compared with the theoretical values (red line). The numerical estimates have been obtained from a sample of size $10^4$.
...and 2 more figures

Theorems & Definitions (6)

Example 1: Uniform distribution in an interval
Example 2
Example 3: Uniform distribution in $d=1$ at finite-$N$
Example 4: Gaussian distributions in $d>1$
Example 5: Uniform distribution in $d=1$
Example 6: Gaussian density

The Most Dispersed Subset of Random Points in $\mathbb{R}^d$

TL;DR

Abstract

The Most Dispersed Subset of Random Points in $\mathbb{R}^d$

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (6)