Table of Contents
Fetching ...

Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs

Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause

TL;DR

This paper addresses the inefficiency of data selection for test-time fine-tuning (TTFT) of large language models caused by data duplication in Nearest Neighbor retrieval. It introduces SIFT, a transductive active-learning algorithm that selects data to minimize the uncertainty of the model's prompt response, effectively maximizing information gain within a tractable surrogate model. The authors prove that SIFT reduces uncertainty and provides a convergence guarantee toward an irreducible uncertainty, while describing compute-efficient implementations and an adaptive, compute-proportional TTFT framework. Empirically, SIFT consistently outperforms NN-based data selection and uncertainty sampling on the Pile benchmark across multiple base models, achieving state-of-the-art TTFT performance on several tasks, and showing that the uncertainty estimates can guide adaptive compute. The work affords a practical drop-in replacement for NN retrieval (activeft library) and suggests scaling laws and future directions for TTFT across domains and model families.

Abstract

Recent efforts in fine-tuning language models often rely on automatic data selection, commonly using Nearest Neighbors retrieval from large datasets. However, we theoretically show that this approach tends to select redundant data, limiting its effectiveness or even hurting performance. To address this, we introduce SIFT, a data selection algorithm designed to reduce uncertainty about the model's response given a prompt, which unifies ideas from retrieval and active learning. Whereas Nearest Neighbor retrieval typically fails in the presence of information duplication, SIFT accounts for information duplication and optimizes the overall information gain of the selected examples. We focus our evaluations on fine-tuning at test-time for prompt-specific language modeling on the Pile dataset, and show that SIFT consistently outperforms Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we show that our uncertainty estimates can predict the performance gain of test-time fine-tuning, and use this to develop an adaptive algorithm that invests test-time compute proportional to realized performance gains. We provide the $\texttt{activeft}$ (Active Fine-Tuning) library which can be used as a drop-in replacement for Nearest Neighbor retrieval.

Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs

TL;DR

This paper addresses the inefficiency of data selection for test-time fine-tuning (TTFT) of large language models caused by data duplication in Nearest Neighbor retrieval. It introduces SIFT, a transductive active-learning algorithm that selects data to minimize the uncertainty of the model's prompt response, effectively maximizing information gain within a tractable surrogate model. The authors prove that SIFT reduces uncertainty and provides a convergence guarantee toward an irreducible uncertainty, while describing compute-efficient implementations and an adaptive, compute-proportional TTFT framework. Empirically, SIFT consistently outperforms NN-based data selection and uncertainty sampling on the Pile benchmark across multiple base models, achieving state-of-the-art TTFT performance on several tasks, and showing that the uncertainty estimates can guide adaptive compute. The work affords a practical drop-in replacement for NN retrieval (activeft library) and suggests scaling laws and future directions for TTFT across domains and model families.

Abstract

Recent efforts in fine-tuning language models often rely on automatic data selection, commonly using Nearest Neighbors retrieval from large datasets. However, we theoretically show that this approach tends to select redundant data, limiting its effectiveness or even hurting performance. To address this, we introduce SIFT, a data selection algorithm designed to reduce uncertainty about the model's response given a prompt, which unifies ideas from retrieval and active learning. Whereas Nearest Neighbor retrieval typically fails in the presence of information duplication, SIFT accounts for information duplication and optimizes the overall information gain of the selected examples. We focus our evaluations on fine-tuning at test-time for prompt-specific language modeling on the Pile dataset, and show that SIFT consistently outperforms Nearest Neighbor retrieval, with minimal computational overhead. Moreover, we show that our uncertainty estimates can predict the performance gain of test-time fine-tuning, and use this to develop an adaptive algorithm that invests test-time compute proportional to realized performance gains. We provide the (Active Fine-Tuning) library which can be used as a drop-in replacement for Nearest Neighbor retrieval.

Paper Structure

This paper contains 96 sections, 9 theorems, 53 equations, 26 figures, 13 tables, 4 algorithms.

Key Result

Theorem 3.2

Let assumption:linear hold and $\boldsymbol{W}^\star \in \mathcal{W}_B$. Let $\delta \in (0,1)$ and set where ${L \mathop{\mathrm{\,\dot{=}\,}}\limits \sup_{\boldsymbol{x} \in \mathcal{X}, \boldsymbol{W} \in \mathcal{W}_B} \lambda_{\max}(\boldsymbol{A}(\boldsymbol{x}; \boldsymbol{W}))}$. Then where $d_{\mathrm{TV}}\left( \boldsymbol{s},\boldsymbol{s'} \right) \mathop{\mathrm{\,\dot{=}\,}}\limits

Figures (26)

  • Figure 1: Selecting fine-tuning data using SIFT (red) robustly outperforms Nearest Neighbor retrieval (black) and avoids the failure-mode of Nearest Neighbor retrieval where the same data is selected repeatedly, which is a common result of information duplication.
  • Figure 2: We consider a scenario where we have a pre-trained language model capturing a latent manifold (red) in the large sequence space (white). We aim to improve the models performance on a given prompt (blue) by efficiently fine-tuning the model on few relevant and diverse data points (black) at test-time.
  • Figure 3: We retrieve two data points to answer the prompt. Nearest Neighbor selects redundant data, while SIFT yields maximal information (cf. §\ref{['sec:examples']}).
  • Figure 4: The (multiplicative) computational overhead of SIFT compared to Nearest Neighbor retrieval is minimal. The compute overhead with a 1k data space is less than $1.05\times$.
  • Figure 5: Bits per byte (in % relative to the base model, $\downarrow$ better) after $50$ test-time iterations. Left: Performance gains of SIFT are consistent across models. The failure-mode of Nearest Neighbor consistently performs worse than the base model. \ref{['table:gpt2large_aft', 'table:phi3_aft']} in §\ref{['sec:full_results']} detail our results with GPT-2-large and Phi-3 analogously to \ref{['table:main_results_per_dataset']}. Right: Most choices of $\lambda'$ lead to comparable performance. With $\lambda' \to \infty$, SIFT($\lambda'$) repeatedly selects the nearest neighbor.
  • ...and 21 more figures

Theorems & Definitions (17)

  • Theorem 3.2: Confidence Sets
  • Proposition 3.3
  • Theorem C.2: Convergence Guarantee, formalization of \ref{['informal_thm:convergence']}
  • proof
  • Proposition K.1: Insufficiency of Nearest Neighbor Retrieval
  • proof
  • proof
  • Theorem K.4: Confidence Sets for Regression
  • proof
  • Lemma K.5: Corollary 1 of amani2021ucb
  • ...and 7 more