Table of Contents
Fetching ...

Private Geometric Median in Nearly-Linear Time

Syamantak Kumar, Daogao Liu, Kevin Tian, Chutong Yang

TL;DR

This work presents a near-linear time, differentially private algorithm for estimating the geometric median in high dimensions. By decomposing the task into radius estimation, centerpoint estimation, and a boosting phase based on stable DP-SGD, the authors achieve an $\alpha$-multiplicative approximation with sample complexity $n \gtrsim \frac{\sqrt{d}}{\alpha\varepsilon}$ and runtime $\widetilde{O}(nd + \frac{d}{\alpha^2})$, matching information-theoretic limits up to polylog factors. Key innovations include subsampling-inspired acceleration of radius and centerpoint subroutines (via FriendlyCore techniques) and a customized DP-SGD analysis tailored to the non-smooth geometric-median objective, enabling private optimization with limited sensitivity. Empirical results illustrate substantial speedups from subsampling and competitiveness of the boosting approach against private baselines. Overall, the paper narrows the gap between private and non-private GM solvers, delivering scalable private robust estimation suitable for large, high-dimensional datasets in practice.

Abstract

Estimating the geometric median of a dataset is a robust counterpart to mean estimation, and is a fundamental problem in computational geometry. Recently, [HSU24] gave an $(\varepsilon, δ)$-differentially private algorithm obtaining an $α$-multiplicative approximation to the geometric median objective, $\frac 1 n \sum_{i \in [n]} \|\cdot - \mathbf{x}_i\|$, given a dataset $\mathcal{D} := \{\mathbf{x}_i\}_{i \in [n]} \subset \mathbb{R}^d$. Their algorithm requires $n \gtrsim \sqrt d \cdot \frac 1 {α\varepsilon}$ samples, which they prove is information-theoretically optimal. This result is surprising because its error scales with the \emph{effective radius} of $\mathcal{D}$ (i.e., of a ball capturing most points), rather than the worst-case radius. We give an improved algorithm that obtains the same approximation quality, also using $n \gtrsim \sqrt d \cdot \frac 1 {αε}$ samples, but in time $\widetilde{O}(nd + \frac d {α^2})$. Our runtime is nearly-linear, plus the cost of the cheapest non-private first-order method due to [CLM+16]. To achieve our results, we use subsampling and geometric aggregation tools inspired by FriendlyCore [TCK+22] to speed up the "warm start" component of the [HSU24] algorithm, combined with a careful custom analysis of DP-SGD's sensitivity for the geometric median objective.

Private Geometric Median in Nearly-Linear Time

TL;DR

This work presents a near-linear time, differentially private algorithm for estimating the geometric median in high dimensions. By decomposing the task into radius estimation, centerpoint estimation, and a boosting phase based on stable DP-SGD, the authors achieve an -multiplicative approximation with sample complexity and runtime , matching information-theoretic limits up to polylog factors. Key innovations include subsampling-inspired acceleration of radius and centerpoint subroutines (via FriendlyCore techniques) and a customized DP-SGD analysis tailored to the non-smooth geometric-median objective, enabling private optimization with limited sensitivity. Empirical results illustrate substantial speedups from subsampling and competitiveness of the boosting approach against private baselines. Overall, the paper narrows the gap between private and non-private GM solvers, delivering scalable private robust estimation suitable for large, high-dimensional datasets in practice.

Abstract

Estimating the geometric median of a dataset is a robust counterpart to mean estimation, and is a fundamental problem in computational geometry. Recently, [HSU24] gave an -differentially private algorithm obtaining an -multiplicative approximation to the geometric median objective, , given a dataset . Their algorithm requires samples, which they prove is information-theoretically optimal. This result is surprising because its error scales with the \emph{effective radius} of (i.e., of a ball capturing most points), rather than the worst-case radius. We give an improved algorithm that obtains the same approximation quality, also using samples, but in time . Our runtime is nearly-linear, plus the cost of the cheapest non-private first-order method due to [CLM+16]. To achieve our results, we use subsampling and geometric aggregation tools inspired by FriendlyCore [TCK+22] to speed up the "warm start" component of the [HSU24] algorithm, combined with a careful custom analysis of DP-SGD's sensitivity for the geometric median objective.

Paper Structure

This paper contains 26 sections, 19 theorems, 54 equations, 5 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

Let $\mathcal{D} = \{\mathbf{x}_i\}_{i \in [n]} \subset \mathbb{B}^d(R)$ for $R > 0$, $0 < r \le r^{(0.9)}$, and $(\alpha, \epsilon, \delta) \in [0, 1]^3$. There is an $(\epsilon, \delta)$-DP algorithm that returns $\hat{\mathbf{x}}$ such that with probability $\ge 1 - \delta$, $f_{\mathcal{D}}(\hat

Figures (5)

  • Figure 1: Comparison of $\mathsf{RadiusFinder}$ and $\mathsf{FastRadius}$ across different data distributions. Plots averaged across $100$ trials and standard deviations are reported as error bars.
  • Figure 2: Comparison of $\mathsf{DPGD}$, $\mathsf{StableDPSGD}$, and $\mathsf{FixedOrderDPSGD}$ across $\mathsf{GaussianCluster}$ data over $\mathbb{R}^{50}$, varying $n$. Plots averaged across $20$ trials and standard deviations are reported as error bars.
  • Figure 3: Comparison of $\mathsf{DPGD}$, $\mathsf{StableDPSGD}$, and $\mathsf{FixedOrderDPSGD}$ across $\mathsf{HeavyTailed}$ data over $\mathbb{R}^{50}$, varying $n$. Plots averaged across $20$ trials and standard deviations are reported as error bars.
  • Figure 4: Comparison of $\mathsf{DPGD}$, $\mathsf{StableDPSGD}$, and $\mathsf{FixedOrderDPSGD}$ across $\mathsf{GaussianCluster}$ data over $\mathbb{R}^{50}$, varying $R$. Plots averaged across $20$ trials and standard deviations are reported as error bars.
  • Figure 5: Comparison of $\mathsf{DPGD}$, $\mathsf{StableDPSGD}$, and $\mathsf{FixedOrderDPSGD}$ across $\mathsf{HeavyTailed}$ data over $\mathbb{R}^{50}$, varying $\nu$. Plots averaged across $20$ trials and standard deviations are reported as error bars.

Theorems & Definitions (37)

  • Theorem 1: informal, see Theorem \ref{['thm:boost']}
  • Theorem 2: informal, see Theorem \ref{['thm:constant_factor']}
  • Definition 1: Differential privacy
  • Lemma 1
  • proof
  • Lemma 2: Theorems 3.23, 3.24, dr14
  • Definition 2: RDP and CDP
  • Lemma 3
  • proof
  • Lemma 4: Lemma 24, cohen2016geometric
  • ...and 27 more