Active Few-Shot Fine-Tuning

Jonas Hübotter; Bhavya Sukhija; Lenart Treven; Yarden As; Andreas Krause

Active Few-Shot Fine-Tuning

Jonas Hübotter, Bhavya Sukhija, Lenart Treven, Yarden As, Andreas Krause

TL;DR

This work proposes ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task, and is the first to show that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data.

Abstract

We study the question: How can we select the right data for fine-tuning to a specific task? We call this data selection problem active fine-tuning and show that it is an instance of transductive active learning, a novel generalization of classical active learning. We propose ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task. We are the first to show, under general regularity assumptions, that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data. We apply ITL to the few-shot fine-tuning of large neural networks and show that fine-tuning with ITL learns the task with significantly fewer examples than the state-of-the-art.

Active Few-Shot Fine-Tuning

TL;DR

Abstract

Paper Structure (55 sections, 14 theorems, 66 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 55 sections, 14 theorems, 66 equations, 11 figures, 3 tables, 1 algorithm.

Introduction
Transductive active learning
Contributions
Preliminaries
Background on information theory
Main Results on Transductive Active Learning
Gaussian Process Setting
Convergence to irreducible uncertainty
Agnostic Setting
Few-Shot Fine-Tuning of Neural Networks
How can we leverage the latent structure learned by the pre-trained model?
Batch selection: Diversity via conditional embeddings
Experiments
Testbeds & architectures
Results
...and 40 more sections

Key Result

Theorem 3.1

Assume that ${f \sim \mathcal{GP}({} \mu, k)}$ with known mean function $\mu$ and kernel $k$, the noise $\varepsilon_{\boldsymbol{x}}$ is mutually independent and zero-mean Gaussian with known variance $\rho^2(\boldsymbol{x}) > 0$, and $\gamma_n$ is sublinear in $n$. Then, for any $n \geq 1, \epsilo

Figures (11)

Figure 1: Instances of transductive active learning where the target space $\mathcal{A}$ is shown in blue and the sample space $\mathcal{S}$ is shown in gray. The points denote plausible observations within $\mathcal{S}$ to "learn" $\mathcal{A}$. In (A), the target space contains "everything" within $\mathcal{S}$ as well as points outside$\mathcal{S}$. In (B, C, D), one makes observations directed towards learning about a particular target. Prior work on active learning has focused on the instances $\mathcal{A} = \mathcal{S}$ and $\mathcal{A} \subset \mathcal{S}$.
Figure 2: Few-shot training of NNs on MNIST (left) and CIFAR-100 (right). Random selects each observation uniformly at random from $\mathcal{P}_{\!\mathcal{S}}$. The batch size is $1$ for MNIST and $10$ for CIFAR-100. Uncertainty bands correspond to one standard error over $10$ random seeds. We see that ITL significantly outperforms the state-of-the-art, and in particular, retrieves substantially more samples from the support of $\mathcal{P}_{\!\!\mathcal{A}}$ than competing methods. This trend becomes even more pronounced in more difficult large-scale learning tasks (cf. \ref{['fig:nns_imbalanced_train']} in \ref{['sec:nns_appendix']}). See \ref{['sec:nns_appendix']} for details and additional experiments.
Figure 3: Batch selection via conditional embeddings improves substantially over selecting the top-$b$ candidates proposed by the decision rule. This is the CIFAR-100 experiment (where $b=10$).
Figure 4: Comparison of loss gradient ("G-") and last-layer embeddings ("L-").
Figure 5: Uncertainty quantification (i.e., estimation of $\boldsymbol{\Sigma}$) via a Laplace approximation (LA, daxberger2021laplace) over last-layer weights using a Kronecker factored log-likelihood Hessian approximation martens2015optimizing and the loss gradient embeddings from \ref{['eq:loss_gradient_embedding']}. The results are shown for the MNIST experiment. We do not observe a performance improvement beyond the trivial approximation $\boldsymbol{\Sigma} = \boldsymbol{I}$.
...and 6 more figures

Theorems & Definitions (28)

Theorem 3.1: Generalization bound on marginal variance for ITL
Theorem 3.2: Bound on generalization error for ITL, following abbasi2013onlinechowdhury2017kernelized
Definition C.1: Submodularity ratio of ITL
Theorem C.2: Efficiency of batch selection via conditional embeddings
Theorem D.1: Bound of uncertainty reduction for ITL
proof
Lemma D.2: Uniform bound of marginal variance within $\mathcal{S}$
proof
Definition D.3: Approximate Markov boundary
Lemma D.4: Existence of an approximate Markov boundary
...and 18 more

Active Few-Shot Fine-Tuning

TL;DR

Abstract

Active Few-Shot Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (28)