Table of Contents
Fetching ...

Active Few-Shot Fine-Tuning

Jonas Hübotter, Bhavya Sukhija, Lenart Treven, Yarden As, Andreas Krause

TL;DR

This work proposes ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task, and is the first to show that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data.

Abstract

We study the question: How can we select the right data for fine-tuning to a specific task? We call this data selection problem active fine-tuning and show that it is an instance of transductive active learning, a novel generalization of classical active learning. We propose ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task. We are the first to show, under general regularity assumptions, that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data. We apply ITL to the few-shot fine-tuning of large neural networks and show that fine-tuning with ITL learns the task with significantly fewer examples than the state-of-the-art.

Active Few-Shot Fine-Tuning

TL;DR

This work proposes ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task, and is the first to show that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data.

Abstract

We study the question: How can we select the right data for fine-tuning to a specific task? We call this data selection problem active fine-tuning and show that it is an instance of transductive active learning, a novel generalization of classical active learning. We propose ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task. We are the first to show, under general regularity assumptions, that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data. We apply ITL to the few-shot fine-tuning of large neural networks and show that fine-tuning with ITL learns the task with significantly fewer examples than the state-of-the-art.
Paper Structure (55 sections, 14 theorems, 66 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 55 sections, 14 theorems, 66 equations, 11 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Assume that ${f \sim \mathcal{GP}({} \mu, k)}$ with known mean function $\mu$ and kernel $k$, the noise $\varepsilon_{\boldsymbol{x}}$ is mutually independent and zero-mean Gaussian with known variance $\rho^2(\boldsymbol{x}) > 0$, and $\gamma_n$ is sublinear in $n$. Then, for any $n \geq 1, \epsilo

Figures (11)

  • Figure 1: Instances of transductive active learning where the target space $\mathcal{A}$ is shown in blue and the sample space $\mathcal{S}$ is shown in gray. The points denote plausible observations within $\mathcal{S}$ to "learn" $\mathcal{A}$. In (A), the target space contains "everything" within $\mathcal{S}$ as well as points outside$\mathcal{S}$. In (B, C, D), one makes observations directed towards learning about a particular target. Prior work on active learning has focused on the instances $\mathcal{A} = \mathcal{S}$ and $\mathcal{A} \subset \mathcal{S}$.
  • Figure 2: Few-shot training of NNs on MNIST (left) and CIFAR-100 (right). Random selects each observation uniformly at random from $\mathcal{P}_{\!\mathcal{S}}$. The batch size is $1$ for MNIST and $10$ for CIFAR-100. Uncertainty bands correspond to one standard error over $10$ random seeds. We see that ITL significantly outperforms the state-of-the-art, and in particular, retrieves substantially more samples from the support of $\mathcal{P}_{\!\!\mathcal{A}}$ than competing methods. This trend becomes even more pronounced in more difficult large-scale learning tasks (cf. \ref{['fig:nns_imbalanced_train']} in \ref{['sec:nns_appendix']}). See \ref{['sec:nns_appendix']} for details and additional experiments.
  • Figure 3: Batch selection via conditional embeddings improves substantially over selecting the top-$b$ candidates proposed by the decision rule. This is the CIFAR-100 experiment (where $b=10$).
  • Figure 4: Comparison of loss gradient ("G-") and last-layer embeddings ("L-").
  • Figure 5: Uncertainty quantification (i.e., estimation of $\boldsymbol{\Sigma}$) via a Laplace approximation (LA, daxberger2021laplace) over last-layer weights using a Kronecker factored log-likelihood Hessian approximation martens2015optimizing and the loss gradient embeddings from \ref{['eq:loss_gradient_embedding']}. The results are shown for the MNIST experiment. We do not observe a performance improvement beyond the trivial approximation $\boldsymbol{\Sigma} = \boldsymbol{I}$.
  • ...and 6 more figures

Theorems & Definitions (28)

  • Theorem 3.1: Generalization bound on marginal variance for ITL
  • Theorem 3.2: Bound on generalization error for ITL, following abbasi2013onlinechowdhury2017kernelized
  • Definition C.1: Submodularity ratio of ITL
  • Theorem C.2: Efficiency of batch selection via conditional embeddings
  • Theorem D.1: Bound of uncertainty reduction for ITL
  • proof
  • Lemma D.2: Uniform bound of marginal variance within $\mathcal{S}$
  • proof
  • Definition D.3: Approximate Markov boundary
  • Lemma D.4: Existence of an approximate Markov boundary
  • ...and 18 more