Table of Contents
Fetching ...

Computing High-dimensional Confidence Sets for Arbitrary Distributions

Chao Gao, Liren Shan, Vaidehi Srinivas, Aravindan Vijayaraghavan

TL;DR

This work tackles the problem of learning high-dimensional confidence sets for arbitrary distributions by seeking minimal-volume sets that capture a target probability $\delta$ in $\mathbb{R}^d$ while being competitive with a bounded-VC class. It introduces an improper learning approach that outputs ellipsoids and achieves a substantially better volume-approximation factor against Euclidean balls than prior core-sets methods, namely $\exp(\tilde{O}(d^{1/2}))$ in the worst case, with improved constants under near-isotropic conditions. A key technical advance is a preconditioning transformation that isotropizes the data inside the target ball, allowing non-worst-case proper-ball learning to transfer to the original space, and enabling a union-of-balls extension via a greedy framework. The paper also establishes hardness results for proper learning (NP-hardness and SSE-based intractability) and demonstrates practical applications to conformal prediction and robust statistics, highlighting the separation between proper and improper learning in this setting. Overall, the results provide distribution-free, polynomial-time algorithms that yield near-optimal dense confidence sets in high dimensions and have immediate implications for uncertainty quantification and conformal prediction in complex data regimes.

Abstract

We study the problem of learning a high-density region of an arbitrary distribution over $\mathbb{R}^d$. Given a target coverage parameter $δ$, and sample access to an arbitrary distribution $D$, we want to output a confidence set $S \subset \mathbb{R}^d$ such that $S$ achieves $δ$ coverage of $D$, i.e., $\mathbb{P}_{y \sim D} \left[ y \in S \right] \ge δ$, and the volume of $S$ is as small as possible. This is a central problem in high-dimensional statistics with applications in finding confidence sets, uncertainty quantification, and support estimation. In the most general setting, this problem is statistically intractable, so we restrict our attention to competing with sets from a concept class $C$ with bounded VC-dimension. An algorithm is competitive with class $C$ if, given samples from an arbitrary distribution $D$, it outputs in polynomial time a set that achieves $δ$ coverage of $D$, and whose volume is competitive with the smallest set in $C$ with the required coverage $δ$. This problem is computationally challenging even in the basic setting when $C$ is the set of all Euclidean balls. Existing algorithms based on coresets find in polynomial time a ball whose volume is $\exp(\tilde{O}( d/ \log d))$-factor competitive with the volume of the best ball. Our main result is an algorithm that finds a confidence set whose volume is $\exp(\tilde{O}(d^{1/2}))$ factor competitive with the optimal ball having the desired coverage. The algorithm is improper (it outputs an ellipsoid). Combined with our computational intractability result for proper learning balls within an $\exp(\tilde{O}(d^{1-o(1)}))$ approximation factor in volume, our results provide an interesting separation between proper and (improper) learning of confidence sets.

Computing High-dimensional Confidence Sets for Arbitrary Distributions

TL;DR

This work tackles the problem of learning high-dimensional confidence sets for arbitrary distributions by seeking minimal-volume sets that capture a target probability in while being competitive with a bounded-VC class. It introduces an improper learning approach that outputs ellipsoids and achieves a substantially better volume-approximation factor against Euclidean balls than prior core-sets methods, namely in the worst case, with improved constants under near-isotropic conditions. A key technical advance is a preconditioning transformation that isotropizes the data inside the target ball, allowing non-worst-case proper-ball learning to transfer to the original space, and enabling a union-of-balls extension via a greedy framework. The paper also establishes hardness results for proper learning (NP-hardness and SSE-based intractability) and demonstrates practical applications to conformal prediction and robust statistics, highlighting the separation between proper and improper learning in this setting. Overall, the results provide distribution-free, polynomial-time algorithms that yield near-optimal dense confidence sets in high dimensions and have immediate implications for uncertainty quantification and conformal prediction in complex data regimes.

Abstract

We study the problem of learning a high-density region of an arbitrary distribution over . Given a target coverage parameter , and sample access to an arbitrary distribution , we want to output a confidence set such that achieves coverage of , i.e., , and the volume of is as small as possible. This is a central problem in high-dimensional statistics with applications in finding confidence sets, uncertainty quantification, and support estimation. In the most general setting, this problem is statistically intractable, so we restrict our attention to competing with sets from a concept class with bounded VC-dimension. An algorithm is competitive with class if, given samples from an arbitrary distribution , it outputs in polynomial time a set that achieves coverage of , and whose volume is competitive with the smallest set in with the required coverage . This problem is computationally challenging even in the basic setting when is the set of all Euclidean balls. Existing algorithms based on coresets find in polynomial time a ball whose volume is -factor competitive with the volume of the best ball. Our main result is an algorithm that finds a confidence set whose volume is factor competitive with the optimal ball having the desired coverage. The algorithm is improper (it outputs an ellipsoid). Combined with our computational intractability result for proper learning balls within an approximation factor in volume, our results provide an interesting separation between proper and (improper) learning of confidence sets.

Paper Structure

This paper contains 26 sections, 22 theorems, 101 equations, 6 figures.

Key Result

Theorem 1.1

There is a polynomial time algorithm that for any target coverage $\delta \in (0, 1)$ and coverage slack $\gamma\in(0,1)$, when given $n=\Omega(d^2/\gamma^2)$ samples drawn i.i.d. from an arbitrary distribution $\mathcal{D}$, finds with high probability a set $S \subset \mathbb{R}^d$ that is $\Gamma and where $B^\star$ is the minimum volume ball that achieves at least $\delta + \gamma + O(\sqrt{d

Figures (6)

  • Figure 1: The figure shows the points $Y$ (in red) with mean $\mu$, that are contained in ball $B$ with center $c$ and radius $R$. It is not necessarily the case that $\mu$ is near $c$, as $B$ can be defined by just a few points. However, Chebyshev's inequality tells us that most of the points $Y$ (depicted here as the points in the shaded region) are within a few standard deviations ($\sigma$) of $\mu$ in the $\mu - c$ direction. This allows us to bound the distance of these points to $\mu$ as $\le \sqrt{R^2 + (t\sigma)^2}$.
  • Figure 2: The figure shows a ball $B^{\star}$ of radius $R^{\star}=\sqrt{d}$ containing a set of points, whose the covariance matrix $\Sigma_{Y^{\star}}$ is not isotropic and has some directions of high variance. In this case, we can try to find a smaller ellipsoid containing many of the points.
  • Figure 3: Algorithm $\textsc{Dense\_Ball}$ for finding a small volume ball that contains at least $\delta' = \delta(1-\gamma)$ fraction of points
  • Figure 4: Algorithm $\textsc{Dense\_Ball\_Isotropic}$ for finding a small volume ball that contains at least $\delta' = \delta(1-\gamma)$ fraction of points for isotropic distributions
  • Figure 5: Algorithm $\textsc{Dense\_Ellipsoid}$ for finding a small volume ellipsoid that contains at least $\delta' = \delta(1-\gamma)$ fraction of points
  • ...and 1 more figures

Theorems & Definitions (43)

  • Theorem 1.1: Learning confidence sets competitive with Euclidean balls
  • Theorem 1.2: Union of Balls
  • Theorem 1.3: NP-hardness of Proper Learning for Balls
  • Theorem 1.4: Bounded variance implies better bounds
  • Theorem 1.5: Conformal Prediction with Approximate Volume Optimality
  • Lemma 2.0: For bounded variance, most points are near the mean
  • proof
  • Theorem 2.1: Finding small volume ellipsoid for $n$ points
  • Theorem 3.1: Finding approximate minimum volume $\delta$-coverage ball
  • Corollary 3.2: Population version of \ref{['thm:main:ball']}
  • ...and 33 more