Table of Contents
Fetching ...

Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds

Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, Alekh Agarwal

TL;DR

The paper tackles label-efficient batch active learning for deep neural networks by proposing BADGE, a gradient-embedding based sampler. BADGE computes last-layer gradient embeddings using the model's current predictions and uses k-means++ seeding to select diverse, informative batches. Through extensive experiments across architectures, batch sizes, and datasets, BADGE consistently matches or outperforms baselines and shows robustness to hyperparameters. The method provides a practical, hyperparameter-free approach that blends uncertainty and diversity, with favorable scaling and runtime characteristics compared with alternatives like k-DPP.

Abstract

We design a new algorithm for batch active learning with deep neural network models. Our algorithm, Batch Active learning by Diverse Gradient Embeddings (BADGE), samples groups of points that are disparate and high-magnitude when represented in a hallucinated gradient space, a strategy designed to incorporate both predictive uncertainty and sample diversity into every selected batch. Crucially, BADGE trades off between diversity and uncertainty without requiring any hand-tuned hyperparameters. We show that while other approaches sometimes succeed for particular batch sizes or architectures, BADGE consistently performs as well or better, making it a versatile option for practical active learning problems.

Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds

TL;DR

The paper tackles label-efficient batch active learning for deep neural networks by proposing BADGE, a gradient-embedding based sampler. BADGE computes last-layer gradient embeddings using the model's current predictions and uses k-means++ seeding to select diverse, informative batches. Through extensive experiments across architectures, batch sizes, and datasets, BADGE consistently matches or outperforms baselines and shows robustness to hyperparameters. The method provides a practical, hyperparameter-free approach that blends uncertainty and diversity, with favorable scaling and runtime characteristics compared with alternatives like k-DPP.

Abstract

We design a new algorithm for batch active learning with deep neural network models. Our algorithm, Batch Active learning by Diverse Gradient Embeddings (BADGE), samples groups of points that are disparate and high-magnitude when represented in a hallucinated gradient space, a strategy designed to incorporate both predictive uncertainty and sample diversity into every selected batch. Crucially, BADGE trades off between diversity and uncertainty without requiring any hand-tuned hyperparameters. We show that while other approaches sometimes succeed for particular batch sizes or architectures, BADGE consistently performs as well or better, making it a versatile option for practical active learning problems.

Paper Structure

This paper contains 19 sections, 1 theorem, 7 equations, 31 figures, 2 algorithms.

Key Result

Proposition 1

For all $y \in \left\{1,\ldots,K\right\}$, let $g_x^y = \frac{\partial}{\partial W} \ell_{\mathrm{CE}}(f(x;\theta), y)$. Then Consequently, $\hat{y} = \mathop{\mathrm{argmin}}_{y \in [K]} \| g_x^y \|$.

Figures (31)

  • Figure 1: Left and center: Learning curves for $\operatorname{\textsc{$k$-means++}}$ and $k$-DPP sampling with gradient embeddings for different scenarios. The performance of the two sampling approaches nearly perfectly overlaps. Right: A run time comparison (seconds) corresponding to the middle scenario. Each line is the average over five independent experiments. Standard errors are shown by shaded regions.
  • Figure 2: A comparison of batch selection algorithms using our gradient embedding. Left and center: Plots showing the log determinant of the Gram matrix of the selected batch of gradient embeddings as learning progresses. Right: The average embedding magnitude (a measurement of predictive uncertainty) in the selected batch. The $\operatorname{\textsc{FF-$k$-center}}$ sampler finds points that are not as diverse or high-magnitude as other samplers. Notice also that $\operatorname{\textsc{$k$-means++}}$ tends to actually select samples that are both more diverse and higher-magnitude than a $k$-DPP, a potential pathology of the $k$-DPP's degree of stochastisity. Standard errors are shown by shaded regions.
  • Figure 3: Active learning test accuracy versus the number of total labeled samples for a range of conditions. Standard errors are shown by shaded regions.
  • Figure 4: A pairwise penalty matrix over all experiments. Element $P_{i,j}$ corresponds roughly to the number of times algorithm $i$ outperforms algorithm $j$. Column-wise averages at the bottom show overall performance (lower is better).
  • Figure 5: The cumulative distribution function of normalized errors for all acquisition functions.
  • ...and 26 more figures

Theorems & Definitions (1)

  • Proposition 1