Table of Contents
Fetching ...

Data subsampling for Poisson regression with pth-root-link

Han Cheng Lie, Alexander Munteanu

TL;DR

This work addresses data reduction for Poisson regression with a $p$th-root-link by developing sublinear coresets under a data-dependent $\rho$-complexity parameter. It leverages sensitivity sampling, VC-dimension analysis, and a novel domain-shifting technique to achieve $(1\pm\varepsilon)$-approximation guarantees on a shifted domain $D(\eta)$, with explicit coreset size bounds that depend on $d$, $\varepsilon$, $n$, and $y_{\max}$. The authors derive tight bounds for $p=1$ and $p=2$ (including a Lambert $W_0$-based analysis) and prove $\Omega(n)$ lower bounds that apply to coresets and any subdata-reduction method, while showing these results cannot extend to $p\ge 3$ with the same approach. They also discuss extreme-point storage, smoothed convex-hull complexity, and experimental validation illustrating practical gains over uniform subsampling. Overall, the paper provides a principled framework and concrete limits for sublinear data summaries in Poisson GLMs, with implications for scalable inference in count data settings.

Abstract

We develop and analyze data subsampling techniques for Poisson regression, the standard model for count data $y\in\mathbb{N}$. In particular, we consider the Poisson generalized linear model with ID- and square root-link functions. We consider the method of coresets, which are small weighted subsets that approximate the loss function of Poisson regression up to a factor of $1\pm\varepsilon$. We show $Ω(n)$ lower bounds against coresets for Poisson regression that continue to hold against arbitrary data reduction techniques up to logarithmic factors. By introducing a novel complexity parameter and a domain shifting approach, we show that sublinear coresets with $1\pm\varepsilon$ approximation guarantee exist when the complexity parameter is small. In particular, the dependence on the number of input points can be reduced to polylogarithmic. We show that the dependence on other input parameters can also be bounded sublinearly, though not always logarithmically. In particular, we show that the square root-link admits an $O(\log(y_{\max}))$ dependence, where $y_{\max}$ denotes the largest count presented in the data, while the ID-link requires a $Θ(\sqrt{y_{\max}/\log(y_{\max})})$ dependence. As an auxiliary result for proving the tightness of the bound with respect to $y_{\max}$ in the case of the ID-link, we show an improved bound on the principal branch of the Lambert $W_0$ function, which may be of independent interest. We further show the limitations of our analysis when $p$th degree root-link functions for $p\geq 3$ are considered, which indicate that other analytical or computational methods would be required if such a generalization is even possible.

Data subsampling for Poisson regression with pth-root-link

TL;DR

This work addresses data reduction for Poisson regression with a th-root-link by developing sublinear coresets under a data-dependent -complexity parameter. It leverages sensitivity sampling, VC-dimension analysis, and a novel domain-shifting technique to achieve -approximation guarantees on a shifted domain , with explicit coreset size bounds that depend on , , , and . The authors derive tight bounds for and (including a Lambert -based analysis) and prove lower bounds that apply to coresets and any subdata-reduction method, while showing these results cannot extend to with the same approach. They also discuss extreme-point storage, smoothed convex-hull complexity, and experimental validation illustrating practical gains over uniform subsampling. Overall, the paper provides a principled framework and concrete limits for sublinear data summaries in Poisson GLMs, with implications for scalable inference in count data settings.

Abstract

We develop and analyze data subsampling techniques for Poisson regression, the standard model for count data . In particular, we consider the Poisson generalized linear model with ID- and square root-link functions. We consider the method of coresets, which are small weighted subsets that approximate the loss function of Poisson regression up to a factor of . We show lower bounds against coresets for Poisson regression that continue to hold against arbitrary data reduction techniques up to logarithmic factors. By introducing a novel complexity parameter and a domain shifting approach, we show that sublinear coresets with approximation guarantee exist when the complexity parameter is small. In particular, the dependence on the number of input points can be reduced to polylogarithmic. We show that the dependence on other input parameters can also be bounded sublinearly, though not always logarithmically. In particular, we show that the square root-link admits an dependence, where denotes the largest count presented in the data, while the ID-link requires a dependence. As an auxiliary result for proving the tightness of the bound with respect to in the case of the ID-link, we show an improved bound on the principal branch of the Lambert function, which may be of independent interest. We further show the limitations of our analysis when th degree root-link functions for are considered, which indicate that other analytical or computational methods would be required if such a generalization is even possible.

Paper Structure

This paper contains 23 sections, 36 theorems, 137 equations, 1 figure, 2 algorithms.

Key Result

Lemma 2.0

It holds for all $z\in\mathbb{R}_{> 0}, p\in[1,\infty), y\in\mathbb{N}$ that

Figures (1)

  • Figure 1: Experimental results for two synthetic data sets with $p=1$ (left), respectively $p=2$ (right). Our method is presented in red and compared against uniform sampling, which is presented in blue. Solid lines indicate the median and shaded areas indicate $\pm 2$ standard errors around the median taken across $201$ independent repetitions for each reduced size between $50$ and $600$ in equal increment steps of $50$. For the blue shaded area below the blue solid line, only feasible repetitions were counted, while the blue shaded area above represents the unbounded standard error without this restriction. For some lower reduced sizes, even the median was infinite, which results in an interrupted blue solid line. This indicates that more than half of the repetitions gave infeasible results when using uniform sampling with low sample sizes, while our method never produced infeasible results.

Theorems & Definitions (57)

  • Lemma 2.0
  • Lemma 2.0
  • Lemma 2.0
  • Lemma 3.0
  • Lemma 3.0
  • Lemma 3.0
  • Lemma 3.0
  • Lemma 3.0
  • Corollary 3.1
  • Lemma 3.1
  • ...and 47 more