Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples

Dake Bu; Wei Huang; Taiji Suzuki; Ji Cheng; Qingfu Zhang; Zhiqiang Xu; Hau-San Wong

Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples

Dake Bu, Wei Huang, Taiji Suzuki, Ji Cheng, Qingfu Zhang, Zhiqiang Xu, Hau-San Wong

TL;DR

This work provides a unified theoretical account for why neural active-learning strategies based on uncertainty and diversity succeed. By modeling data as containing easy/strong and hard/weak features plus noise, the authors show that both query criteria effectively prioritize perplexing samples that reveal yet-to-be-learned features, enabling small labeled sets to achieve low test error through benign overfitting. They prove label-complexity reductions and provide explicit bounds under pool-based sampling, validated by experiments on linear and XOR data. The results offer a principled lens to understand NAL and suggest practical extensions, including connections to BADGE and multi-round active learning, with implications for reducing labeling costs in imbalanced datasets.

Abstract

Neural Network-based active learning (NAL) is a cost-effective data selection technique that utilizes neural networks to select and train on a small subset of samples. While existing work successfully develops various effective or theory-justified NAL algorithms, the understanding of the two commonly used query criteria of NAL: uncertainty-based and diversity-based, remains in its infancy. In this work, we try to move one step forward by offering a unified explanation for the success of both query criteria-based NAL from a feature learning view. Specifically, we consider a feature-noise data model comprising easy-to-learn or hard-to-learn features disrupted by noise, and conduct analysis over 2-layer NN-based NALs in the pool-based scenario. We provably show that both uncertainty-based and diversity-based NAL are inherently amenable to one and the same principle, i.e., striving to prioritize samples that contain yet-to-be-learned features. We further prove that this shared principle is the key to their success-achieve small test error within a small labeled set. Contrastingly, the strategy-free passive learning exhibits a large test error due to the inadequate learning of yet-to-be-learned features, necessitating resort to a significantly larger label complexity for a sufficient test error reduction. Experimental results validate our findings.

Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples

TL;DR

Abstract

Paper Structure (49 sections, 47 theorems, 133 equations, 8 figures, 1 algorithm)

This paper contains 49 sections, 47 theorems, 133 equations, 8 figures, 1 algorithm.

Introduction
Our Contribution
Related Work
Problem Settings
Data Distribution
Querying Algorithms
Theoretical Results
Proof Sketch
Feature Learning and Noise Memorization Analysis
Order-dependent Sampling (Querying) Analysis
Label Complexity-based Test Error Analysis
Experiments
Potential Extension and Implication for Practical NALs
Conclusion
Additional Related Work: Theory of Feature Learning in Overparameterized Neural Network
...and 34 more sections

Key Result

Proposition 3.2

(Before Querying) At the initial stage before querying, $\forall \varepsilon>0$, under Condition Con4.1, with probability at least $1-\delta$, there exists $t=\widetilde{O}\left(\eta^{-1} \varepsilon^{-1} m n_0 d^{-1} \sigma_p^{-2}\right)$, the followings hold for all of the three querying algorithm

Figures (8)

Figure 1: Lions in real-world dataset.
Figure 2: Learning/memorization progress of features and noise ($\gamma_l$ represents $\max_{j,k} \gamma_{j,k,l}^{(t)}$, and $\rho$ represents $\max_{j,k,i} \rho_{j,k,i}^{(t)}$, train/test losses, and test accuracy of the full-trained model and the three querying algorithms, with $T^*=200$, $d=2000$, $\|\boldsymbol{\mu}_1\|=9$, $p=p^*=0.2$, $\|\boldsymbol{\mu}_2\|=3$, $n_{CNN}=200$, $n_0=10$, $n^{*}=30$ and $\lvert \mathcal{P} \rvert = 190$.
Figure 3: Rescaled $\gamma$ ($\gamma={\mathbb{E}} \gamma_{j,k,l}^{(t)}$), Uncertainty (i.e., $-$Confidence Score) and Feature Distance (with various $p$ of $l_p$ norm) of the samples in sampling pool $\mathcal{P}$, where $\gamma$ represents the learning progress of feature in particular sample. The dashed line in the graph represents the top 30 samples with the highest Feature Distance.
Figure 4: Comparison of querying information between two NAL algorithms, illustrating training size changes in labeled data sets, Confidence Score, and Feature Distance before and after querying.
Figure 5: Learning/memorization progress of features and noise ($\gamma_l$ represents $\max_{j,k} \gamma_{j,k,l}^{(t)}$, and $\rho$ represents $\max_{j,k,i} {j,k,i}^{(t)}$), train/test losses, and test accuracy of the full-trained model and the three querying algorithms, with $T^*=200$, $d=2000$, $\|\boldsymbol{\mu}_1\|=8$, $p=p^*=0.1$, $\|\boldsymbol{\mu}_2\|=8$, $n_{CNN}=200$, $n_0=10$, $n^{*}=30$ and $\lvert \mathcal{P} \rvert = 190$.
...and 3 more figures

Theorems & Definitions (66)

Definition 2.1
Proposition 3.2
Proposition 3.3
Theorem 3.4
Corollary 3.5
Lemma 4.1
Lemma 4.2
Remark 4.3
Lemma 4.4
Lemma 4.5
...and 56 more

Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples

TL;DR

Abstract

Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (66)