Table of Contents
Fetching ...

A Provably Accurate Randomized Sampling Algorithm for Logistic Regression

Agniva Chowdhury, Pradeep Ramuhalli

TL;DR

This work addresses scalable logistic regression in the $n\gg d$ regime by introducing a leverage-score based sketching approach that subsamples $s=\mathcal{O}(d/\varepsilon^{2})$ observations to form a diagonal sketching matrix $\mathbf{S}$. The subsampled log-likelihood $\bar{\ell}(\boldsymbol{\beta})$ yields an estimator $\hat{\boldsymbol{\beta}}$ whose estimated probabilities satisfy $\|\mathbf{p}(\hat{\boldsymbol{\beta}})-\mathbf{p}(\boldsymbol{\beta}^{*})\|_{2} \le \varepsilon\,\|\mathbf{y}-\mathbf{p}(\boldsymbol{\beta}^{*})\|_{2}$, with high probability, under two structural conditions reducible to randomized matrix multiplication. The key contributions include a tight probability-bound for the probabilities, a sampling complexity independent of the data-dependent complexity measure $\mu_{\mathbf{y}}(\mathbf{X})$, and the use of standard leverage scores, accompanied by empirical validation on real datasets showing competitive performance to full data and prior coresets. This method offers a practical, provably accurate, and computationally efficient solution for large-scale binary classification problems. The results advance sketching-based approaches for logistic regression by delivering finite-sample guarantees with simple leverage-score based sampling.

Abstract

In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.

A Provably Accurate Randomized Sampling Algorithm for Logistic Regression

TL;DR

This work addresses scalable logistic regression in the regime by introducing a leverage-score based sketching approach that subsamples observations to form a diagonal sketching matrix . The subsampled log-likelihood yields an estimator whose estimated probabilities satisfy , with high probability, under two structural conditions reducible to randomized matrix multiplication. The key contributions include a tight probability-bound for the probabilities, a sampling complexity independent of the data-dependent complexity measure , and the use of standard leverage scores, accompanied by empirical validation on real datasets showing competitive performance to full data and prior coresets. This method offers a practical, provably accurate, and computationally efficient solution for large-scale binary classification problems. The results advance sketching-based approaches for logistic regression by delivering finite-sample guarantees with simple leverage-score based sampling.

Abstract

In statistics and machine learning, logistic regression is a widely-used supervised learning technique primarily employed for binary classification tasks. When the number of observations greatly exceeds the number of predictor variables, we present a simple, randomized sampling-based algorithm for logistic regression problem that guarantees high-quality approximations to both the estimated probabilities and the overall discrepancy of the model. Our analysis builds upon two simple structural conditions that boil down to randomized matrix multiplication, a fundamental and well-understood primitive of randomized numerical linear algebra. We analyze the properties of estimated probabilities of logistic regression when leverage scores are used to sample observations, and prove that accurate approximations can be achieved with a sample whose size is much smaller than the total number of observations. To further validate our theoretical findings, we conduct comprehensive empirical evaluations. Overall, our work sheds light on the potential of using randomized sampling approaches to efficiently approximate the estimated probabilities in logistic regression, offering a practical and computationally efficient solution for large-scale datasets.
Paper Structure (15 sections, 7 theorems, 33 equations, 3 figures, 2 algorithms)

This paper contains 15 sections, 7 theorems, 33 equations, 3 figures, 2 algorithms.

Key Result

Theorem 1

Let $\mathbf{X}\in\mathbb{R}^{n \times d}$ and $\mathbf{y}\in\{0,1\}^n$ be the inputs of the logistic regression problem. Assume that for some constant $0<\varepsilon< 1$, the sketching matrix $\mathbf{S} \in \mathbb{R}^{s\times n}$ satisfies the structural conditions of eqns. (eq:cond1) and eq:cond Recall that $\mathbf{p}(\boldsymbol{\beta}^*)$ is the vector of estimated probabilities from the fu

Figures (3)

  • Figure 1: Experiment results on real data: The top row of plots illustrates the relative errors in estimated probabilities and the bottom row shows misclassification rates. Errors are in log-scale.
  • Figure 2: Relative error full-data negative log-likelihood (top row) and subsampled negative log-likelihood (bottom row) for all three datasets. Errors are in log-scale.
  • Figure 3: Standard deviations for all the metrics in Figures \ref{['fig:mainfig2']} and \ref{['fig:mainfig3']}. All the errors are in log-scale.

Theorems & Definitions (9)

  • Theorem 1
  • Remark 2
  • Corollary 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Remark 8
  • Lemma A1