Private Geometric Median in Nearly-Linear Time
Syamantak Kumar, Daogao Liu, Kevin Tian, Chutong Yang
TL;DR
This work presents a near-linear time, differentially private algorithm for estimating the geometric median in high dimensions. By decomposing the task into radius estimation, centerpoint estimation, and a boosting phase based on stable DP-SGD, the authors achieve an $\alpha$-multiplicative approximation with sample complexity $n \gtrsim \frac{\sqrt{d}}{\alpha\varepsilon}$ and runtime $\widetilde{O}(nd + \frac{d}{\alpha^2})$, matching information-theoretic limits up to polylog factors. Key innovations include subsampling-inspired acceleration of radius and centerpoint subroutines (via FriendlyCore techniques) and a customized DP-SGD analysis tailored to the non-smooth geometric-median objective, enabling private optimization with limited sensitivity. Empirical results illustrate substantial speedups from subsampling and competitiveness of the boosting approach against private baselines. Overall, the paper narrows the gap between private and non-private GM solvers, delivering scalable private robust estimation suitable for large, high-dimensional datasets in practice.
Abstract
Estimating the geometric median of a dataset is a robust counterpart to mean estimation, and is a fundamental problem in computational geometry. Recently, [HSU24] gave an $(\varepsilon, δ)$-differentially private algorithm obtaining an $α$-multiplicative approximation to the geometric median objective, $\frac 1 n \sum_{i \in [n]} \|\cdot - \mathbf{x}_i\|$, given a dataset $\mathcal{D} := \{\mathbf{x}_i\}_{i \in [n]} \subset \mathbb{R}^d$. Their algorithm requires $n \gtrsim \sqrt d \cdot \frac 1 {α\varepsilon}$ samples, which they prove is information-theoretically optimal. This result is surprising because its error scales with the \emph{effective radius} of $\mathcal{D}$ (i.e., of a ball capturing most points), rather than the worst-case radius. We give an improved algorithm that obtains the same approximation quality, also using $n \gtrsim \sqrt d \cdot \frac 1 {αε}$ samples, but in time $\widetilde{O}(nd + \frac d {α^2})$. Our runtime is nearly-linear, plus the cost of the cheapest non-private first-order method due to [CLM+16]. To achieve our results, we use subsampling and geometric aggregation tools inspired by FriendlyCore [TCK+22] to speed up the "warm start" component of the [HSU24] algorithm, combined with a careful custom analysis of DP-SGD's sensitivity for the geometric median objective.
