The Most Dispersed Subset of Random Points in $\mathbb{R}^d$
Fabio Deelan Cunden, Noemi Cuppone, Giovanni Gramegna, Pierpaolo Vivo
TL;DR
The paper addresses the problem of selecting a subset of $M$ points from $N$ random points in $\mathbb{R}^d$ to maximize pairwise dispersion. It develops two complementary analytic pipelines—a mean-field treatment based on order statistics and a replica-method approach—to derive the full large-$N$ statistics of the maximal $M$-dispersion, including large-deviation tails via the SCGF and rate functions. A central finding is that the optimal subset, for any $d\ge 1$ with rotationally symmetric $F$, consists of points lying outside a self-consistently determined $d$-ball, with explicit radius relations, while in $d=1$ the finite-$N,M$ case admits exact prefix-suffix geometry and accurate balanced approximants. The results are validated against numerical simulations and heuristic greedy algorithms, and the dual methodologies reinforce each other by yielding the same SCGF in rotationally symmetric settings. These insights advance rigorous understanding of diversity optimization in high dimensions and offer benchmarks for dispersion-based heuristics.
Abstract
Consider a population of $N$ individuals, each having $d\geq 1$ different traits, and an additive measure, called dispersion, which rewards large pairwise separations between traits. The goal is to select $M\leq N$ individuals such that their traits are as dispersed as possible. We compute analytically the full statistics (including large deviation tails) of the maximally achievable dispersion among sub-populations of size $M$ when the traits are independent and identically distributed. Two complementary approaches are developed, one based on a mean-field theory for order statistics, and the other on the replica method from the field of disordered systems. In all dimensions $d$, and for rotationally symmetric distributions, the optimal subset for large populations consists of all points lying outside a $d$-dimensional ball whose radius is determined self-consistently. For a single trait ($d=1$), the statistics of the maximal dispersion can be tackled for finite $N,M$ as well. The formulae we obtained are corroborated by numerical simulations on small instances and by heuristic algorithms that find near-optimal solutions.
