Table of Contents
Fetching ...

Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples

Dake Bu, Wei Huang, Taiji Suzuki, Ji Cheng, Qingfu Zhang, Zhiqiang Xu, Hau-San Wong

TL;DR

This work provides a unified theoretical account for why neural active-learning strategies based on uncertainty and diversity succeed. By modeling data as containing easy/strong and hard/weak features plus noise, the authors show that both query criteria effectively prioritize perplexing samples that reveal yet-to-be-learned features, enabling small labeled sets to achieve low test error through benign overfitting. They prove label-complexity reductions and provide explicit bounds under pool-based sampling, validated by experiments on linear and XOR data. The results offer a principled lens to understand NAL and suggest practical extensions, including connections to BADGE and multi-round active learning, with implications for reducing labeling costs in imbalanced datasets.

Abstract

Neural Network-based active learning (NAL) is a cost-effective data selection technique that utilizes neural networks to select and train on a small subset of samples. While existing work successfully develops various effective or theory-justified NAL algorithms, the understanding of the two commonly used query criteria of NAL: uncertainty-based and diversity-based, remains in its infancy. In this work, we try to move one step forward by offering a unified explanation for the success of both query criteria-based NAL from a feature learning view. Specifically, we consider a feature-noise data model comprising easy-to-learn or hard-to-learn features disrupted by noise, and conduct analysis over 2-layer NN-based NALs in the pool-based scenario. We provably show that both uncertainty-based and diversity-based NAL are inherently amenable to one and the same principle, i.e., striving to prioritize samples that contain yet-to-be-learned features. We further prove that this shared principle is the key to their success-achieve small test error within a small labeled set. Contrastingly, the strategy-free passive learning exhibits a large test error due to the inadequate learning of yet-to-be-learned features, necessitating resort to a significantly larger label complexity for a sufficient test error reduction. Experimental results validate our findings.

Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples

TL;DR

This work provides a unified theoretical account for why neural active-learning strategies based on uncertainty and diversity succeed. By modeling data as containing easy/strong and hard/weak features plus noise, the authors show that both query criteria effectively prioritize perplexing samples that reveal yet-to-be-learned features, enabling small labeled sets to achieve low test error through benign overfitting. They prove label-complexity reductions and provide explicit bounds under pool-based sampling, validated by experiments on linear and XOR data. The results offer a principled lens to understand NAL and suggest practical extensions, including connections to BADGE and multi-round active learning, with implications for reducing labeling costs in imbalanced datasets.

Abstract

Neural Network-based active learning (NAL) is a cost-effective data selection technique that utilizes neural networks to select and train on a small subset of samples. While existing work successfully develops various effective or theory-justified NAL algorithms, the understanding of the two commonly used query criteria of NAL: uncertainty-based and diversity-based, remains in its infancy. In this work, we try to move one step forward by offering a unified explanation for the success of both query criteria-based NAL from a feature learning view. Specifically, we consider a feature-noise data model comprising easy-to-learn or hard-to-learn features disrupted by noise, and conduct analysis over 2-layer NN-based NALs in the pool-based scenario. We provably show that both uncertainty-based and diversity-based NAL are inherently amenable to one and the same principle, i.e., striving to prioritize samples that contain yet-to-be-learned features. We further prove that this shared principle is the key to their success-achieve small test error within a small labeled set. Contrastingly, the strategy-free passive learning exhibits a large test error due to the inadequate learning of yet-to-be-learned features, necessitating resort to a significantly larger label complexity for a sufficient test error reduction. Experimental results validate our findings.
Paper Structure (49 sections, 47 theorems, 133 equations, 8 figures, 1 algorithm)

This paper contains 49 sections, 47 theorems, 133 equations, 8 figures, 1 algorithm.

Key Result

Proposition 3.2

(Before Querying) At the initial stage before querying, $\forall \varepsilon>0$, under Condition Con4.1, with probability at least $1-\delta$, there exists $t=\widetilde{O}\left(\eta^{-1} \varepsilon^{-1} m n_0 d^{-1} \sigma_p^{-2}\right)$, the followings hold for all of the three querying algorithm

Figures (8)

  • Figure 1: Lions in real-world dataset.
  • Figure 2: Learning/memorization progress of features and noise ($\gamma_l$ represents $\max_{j,k} \gamma_{j,k,l}^{(t)}$, and $\rho$ represents $\max_{j,k,i} \rho_{j,k,i}^{(t)}$, train/test losses, and test accuracy of the full-trained model and the three querying algorithms, with $T^*=200$, $d=2000$, $\|\boldsymbol{\mu}_1\|=9$, $p=p^*=0.2$, $\|\boldsymbol{\mu}_2\|=3$, $n_{CNN}=200$, $n_0=10$, $n^{*}=30$ and $\lvert \mathcal{P} \rvert = 190$.
  • Figure 3: Rescaled $\gamma$ ($\gamma={\mathbb{E}} \gamma_{j,k,l}^{(t)}$), Uncertainty (i.e., $-$Confidence Score) and Feature Distance (with various $p$ of $l_p$ norm) of the samples in sampling pool $\mathcal{P}$, where $\gamma$ represents the learning progress of feature in particular sample. The dashed line in the graph represents the top 30 samples with the highest Feature Distance.
  • Figure 4: Comparison of querying information between two NAL algorithms, illustrating training size changes in labeled data sets, Confidence Score, and Feature Distance before and after querying.
  • Figure 5: Learning/memorization progress of features and noise ($\gamma_l$ represents $\max_{j,k} \gamma_{j,k,l}^{(t)}$, and $\rho$ represents $\max_{j,k,i} {j,k,i}^{(t)}$), train/test losses, and test accuracy of the full-trained model and the three querying algorithms, with $T^*=200$, $d=2000$, $\|\boldsymbol{\mu}_1\|=8$, $p=p^*=0.1$, $\|\boldsymbol{\mu}_2\|=8$, $n_{CNN}=200$, $n_0=10$, $n^{*}=30$ and $\lvert \mathcal{P} \rvert = 190$.
  • ...and 3 more figures

Theorems & Definitions (66)

  • Definition 2.1
  • Proposition 3.2
  • Proposition 3.3
  • Theorem 3.4
  • Corollary 3.5
  • Lemma 4.1
  • Lemma 4.2
  • Remark 4.3
  • Lemma 4.4
  • Lemma 4.5
  • ...and 56 more