Table of Contents
Fetching ...

Near-Polynomially Competitive Active Logistic Regression

Yihan Zhou, Eric Price, Trung Nguyen

TL;DR

This work tackles active learning for logistic regression in the realizable probabilistic setting, aiming to minimize label queries while preserving accuracy. The authors develop a multiplicative-weights–based algorithm enhanced with double querying, a sampling scheme, and clipping, plus dimension-reduction techniques, to achieve a label complexity that is polynomially competitive with the optimal up to polylogarithmic factors. They prove a final bound of the form $O\big(d^2 m^{20} \log^{26}(1/\gamma) \log^2( dR_1R_2/(\varepsilon))\big)$, and show that with clipping and dimension reduction, the approach remains robust and scalable; they also demonstrate practical gains via experiments on synthetic and Musk datasets. The results indicate that near-polynomially competitive active logistic regression can substantially outperform passive strategies in real-world settings and extend to broader probabilistic binary classifiers, including exponential-family models.

Abstract

We address the problem of active logistic regression in the realizable setting. It is well known that active learning can require exponentially fewer label queries compared to passive learning, in some cases using $\log \frac{1}{\eps}$ rather than $\poly(1/\eps)$ labels to get error $\eps$ larger than the optimum. We present the first algorithm that is polynomially competitive with the optimal algorithm on every input instance, up to factors polylogarithmic in the error and domain size. In particular, if any algorithm achieves label complexity polylogarithmic in $\eps$, so does ours. Our algorithm is based on efficient sampling and can be extended to learn more general class of functions. We further support our theoretical results with experiments demonstrating performance gains for logistic regression compared to existing active learning algorithms.

Near-Polynomially Competitive Active Logistic Regression

TL;DR

This work tackles active learning for logistic regression in the realizable probabilistic setting, aiming to minimize label queries while preserving accuracy. The authors develop a multiplicative-weights–based algorithm enhanced with double querying, a sampling scheme, and clipping, plus dimension-reduction techniques, to achieve a label complexity that is polynomially competitive with the optimal up to polylogarithmic factors. They prove a final bound of the form , and show that with clipping and dimension reduction, the approach remains robust and scalable; they also demonstrate practical gains via experiments on synthetic and Musk datasets. The results indicate that near-polynomially competitive active logistic regression can substantially outperform passive strategies in real-world settings and extend to broader probabilistic binary classifiers, including exponential-family models.

Abstract

We address the problem of active logistic regression in the realizable setting. It is well known that active learning can require exponentially fewer label queries compared to passive learning, in some cases using rather than labels to get error larger than the optimum. We present the first algorithm that is polynomially competitive with the optimal algorithm on every input instance, up to factors polylogarithmic in the error and domain size. In particular, if any algorithm achieves label complexity polylogarithmic in , so does ours. Our algorithm is based on efficient sampling and can be extended to learn more general class of functions. We further support our theoretical results with experiments demonstrating performance gains for logistic regression compared to existing active learning algorithms.

Paper Structure

This paper contains 53 sections, 23 theorems, 132 equations, 2 figures, 1 table.

Key Result

Theorem 1.5

Let Under Assumption assumption:boundedness, Algorithm Alg:RealMain returns a hypothesis $\hat{h}$ such that $\mathop{\mathrm{err}}\limits\left( \hat{h} \right) \leq 17 \varepsilon$ with probability at least $0.7$, using a label complexity of

Figures (2)

  • Figure 1: Comparison of OURS with PASS, LSS and ACED on (a) a 100-dimension synthetic dataset and (b) the Musk dataset
  • Figure 2: Comparison of OURS with PASS, LSS and ACED in terms of the weighted $\ell_2$-distances between estimated hypotheses and the ground truth hypothesis on the synthetic dataset

Theorems & Definitions (50)

  • Definition 1.1: Realizable Active Probabilistic Classification
  • Definition 1.2: Optimal Label Complexity
  • Example 1.4
  • Theorem 1.5
  • Lemma 3.1: Lower Bound for Non-concentrated Distribution
  • Lemma 3.2
  • Lemma 4.1
  • Lemma 5.2
  • Lemma 5.3
  • Lemma 5.4
  • ...and 40 more