Table of Contents
Fetching ...

Optimal bounds for $\ell_p$ sensitivity sampling via $\ell_2$ augmentation

Alexander Munteanu, Simon Omlor

TL;DR

The paper tackles the problem of constructing accurate $\ell_p$ subspace embeddings via sensitivity sampling. By introducing an $\ell_2$ augmentation—sampling probabilities that combine $\ell_p$ and $\ell_2$ leverage scores—the authors obtain a linear in $d$ (up to polylogs) sampling complexity of $\tilde{O}(\varepsilon^{-2}(\mathfrak S+d))$ for all $p\in[1,2]$, resolving an open question and matching lower bounds up to polylog factors. They establish a tight lower bound against pure $\ell_p$ leverage score sampling and provide a general framework to handle weighted norms and the $p$-ReLU and logistic loss, yielding a fully linear $\tilde{O}(\varepsilon^{-2}\mu d)$ bound for logistic regression. The approach blends Gaussian-process-based error bounds, diameter and entropy control, and weighted covering arguments to achieve the main result, with practical implications for efficient, scalable subsampling in regression and related problems. Overall, the work tightens the theoretical understanding of sensitivity sampling and broadens the regime where simple sensitivity-based subsampling matches the best known bounds from Lewis weights.

Abstract

Data subsampling is one of the most natural methods to approximate a massively large data set by a small representative proxy. In particular, sensitivity sampling received a lot of attention, which samples points proportional to an individual importance measure called sensitivity. This framework reduces in very general settings the size of data to roughly the VC dimension $d$ times the total sensitivity $\mathfrak S$ while providing strong $(1\pm\varepsilon)$ guarantees on the quality of approximation. The recent work of Woodruff & Yasuda (2023c) improved substantially over the general $\tilde O(\varepsilon^{-2}\mathfrak Sd)$ bound for the important problem of $\ell_p$ subspace embeddings to $\tilde O(\varepsilon^{-2}\mathfrak S^{2/p})$ for $p\in[1,2]$. Their result was subsumed by an earlier $\tilde O(\varepsilon^{-2}\mathfrak Sd^{1-p/2})$ bound which was implicitly given in the work of Chen & Derezinski (2021). We show that their result is tight when sampling according to plain $\ell_p$ sensitivities. We observe that by augmenting the $\ell_p$ sensitivities by $\ell_2$ sensitivities, we obtain better bounds improving over the aforementioned results to optimal linear $\tilde O(\varepsilon^{-2}(\mathfrak S+d)) = \tilde O(\varepsilon^{-2}d)$ sampling complexity for all $p \in [1,2]$. In particular, this resolves an open question of Woodruff & Yasuda (2023c) in the affirmative for $p \in [1,2]$ and brings sensitivity subsampling into the regime that was previously only known to be possible using Lewis weights (Cohen & Peng, 2015). As an application of our main result, we also obtain an $\tilde O(\varepsilon^{-2}μd)$ sensitivity sampling bound for logistic regression, where $μ$ is a natural complexity measure for this problem. This improves over the previous $\tilde O(\varepsilon^{-2}μ^2 d)$ bound of Mai et al. (2021) which was based on Lewis weights subsampling.

Optimal bounds for $\ell_p$ sensitivity sampling via $\ell_2$ augmentation

TL;DR

The paper tackles the problem of constructing accurate subspace embeddings via sensitivity sampling. By introducing an augmentation—sampling probabilities that combine and leverage scores—the authors obtain a linear in (up to polylogs) sampling complexity of for all , resolving an open question and matching lower bounds up to polylog factors. They establish a tight lower bound against pure leverage score sampling and provide a general framework to handle weighted norms and the -ReLU and logistic loss, yielding a fully linear bound for logistic regression. The approach blends Gaussian-process-based error bounds, diameter and entropy control, and weighted covering arguments to achieve the main result, with practical implications for efficient, scalable subsampling in regression and related problems. Overall, the work tightens the theoretical understanding of sensitivity sampling and broadens the regime where simple sensitivity-based subsampling matches the best known bounds from Lewis weights.

Abstract

Data subsampling is one of the most natural methods to approximate a massively large data set by a small representative proxy. In particular, sensitivity sampling received a lot of attention, which samples points proportional to an individual importance measure called sensitivity. This framework reduces in very general settings the size of data to roughly the VC dimension times the total sensitivity while providing strong guarantees on the quality of approximation. The recent work of Woodruff & Yasuda (2023c) improved substantially over the general bound for the important problem of subspace embeddings to for . Their result was subsumed by an earlier bound which was implicitly given in the work of Chen & Derezinski (2021). We show that their result is tight when sampling according to plain sensitivities. We observe that by augmenting the sensitivities by sensitivities, we obtain better bounds improving over the aforementioned results to optimal linear sampling complexity for all . In particular, this resolves an open question of Woodruff & Yasuda (2023c) in the affirmative for and brings sensitivity subsampling into the regime that was previously only known to be possible using Lewis weights (Cohen & Peng, 2015). As an application of our main result, we also obtain an sensitivity sampling bound for logistic regression, where is a natural complexity measure for this problem. This improves over the previous bound of Mai et al. (2021) which was based on Lewis weights subsampling.
Paper Structure (29 sections, 25 theorems, 168 equations, 1 figure)

This paper contains 29 sections, 25 theorems, 168 equations, 1 figure.

Key Result

Theorem 1.3

There exists a matrix $A\in\mathbb{R}^{m\times 2d}$, for sufficiently large $m\gg 2d$, such that if we sample each row $i \in [n]$ with probability $p_i:= \min \{1, k l_i^{(p)}\}$ for some $k \in \mathbb{N}$, then with high probability, the $\ell_p$ subspace embedding guarantee (see lp_approx_guaran

Figures (1)

  • Figure 1: Leading dependence on $d$ for $\ell_p$ sensitivity sampling for $p\in[1,2]$ in the worst case, i.e., when $\mathfrak S^{(p)}=d$. The horizontal axis represents $p$. The vertical axis indicates the exponent on $d$ in the respective sample complexity results. The red line indicates the standard bounds obtained from a plain application of the sensitivity framework FeldmanSS20, blue indicates the result of woodruffyasuda23, yellow indicates the result of ChenD21, and green indicates our new main result.

Theorems & Definitions (52)

  • Definition 1.1: $\ell_p$-sensitivities/-leverage scores
  • Theorem 1.3: Informal restatement of \ref{['thm: lowerbound']}
  • Definition 1.4: $\mu$-complexity, MunteanuSSW18MunteanuOP22, slightly modified
  • Theorem 1.5: Informal restatement of \ref{['thm:samplingthm']}
  • Theorem 1.6: Informal restatement of \ref{['thm:logistic']}
  • Definition 2.1
  • Theorem B.1
  • proof
  • Definition C.1: Lévy mean
  • Theorem C.2: Dual Sudakov minoration, Proposition 4.2 of BLM1989
  • ...and 42 more