Near-Polynomially Competitive Active Logistic Regression
Yihan Zhou, Eric Price, Trung Nguyen
TL;DR
This work tackles active learning for logistic regression in the realizable probabilistic setting, aiming to minimize label queries while preserving accuracy. The authors develop a multiplicative-weights–based algorithm enhanced with double querying, a sampling scheme, and clipping, plus dimension-reduction techniques, to achieve a label complexity that is polynomially competitive with the optimal up to polylogarithmic factors. They prove a final bound of the form $O\big(d^2 m^{20} \log^{26}(1/\gamma) \log^2( dR_1R_2/(\varepsilon))\big)$, and show that with clipping and dimension reduction, the approach remains robust and scalable; they also demonstrate practical gains via experiments on synthetic and Musk datasets. The results indicate that near-polynomially competitive active logistic regression can substantially outperform passive strategies in real-world settings and extend to broader probabilistic binary classifiers, including exponential-family models.
Abstract
We address the problem of active logistic regression in the realizable setting. It is well known that active learning can require exponentially fewer label queries compared to passive learning, in some cases using $\log \frac{1}{\eps}$ rather than $\poly(1/\eps)$ labels to get error $\eps$ larger than the optimum. We present the first algorithm that is polynomially competitive with the optimal algorithm on every input instance, up to factors polylogarithmic in the error and domain size. In particular, if any algorithm achieves label complexity polylogarithmic in $\eps$, so does ours. Our algorithm is based on efficient sampling and can be extended to learn more general class of functions. We further support our theoretical results with experiments demonstrating performance gains for logistic regression compared to existing active learning algorithms.
