Table of Contents
Fetching ...

An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models

Gantavya Bhatt, Yifang Chen, Arnav M. Das, Jifan Zhang, Sang T. Truong, Stephen Mussmann, Yinglun Zhu, Jeffrey Bilmes, Simon S. Du, Kevin Jamieson, Jordan T. Ash, Robert D. Nowak

TL;DR

The paper tackles the rising annotation burden in supervised finetuning of large language models by introducing an experimental-design framework that selects an informative, fixed set of prompts to annotate in one shot, avoiding the iterative retraining costs of active learning. It develops and evaluates uncertainty-based, k-center, and submodular selection strategies, including novel scores like maximum token uncertainty and a facility-location-based approach, demonstrating substantial label-efficiency gains. On a 7B-scale LLaMA-2 finetuned with LoRA over a 99K FLAN V2 prompt pool, the methods achieve roughly 50% annotation cost savings while maintaining or improving generalization on MMLU and BBH benchmarks, with GPT-4-based evaluations corroborating the improvements. This approach promises scalable, cost-effective instruction tuning for domain-specific or compute-constrained settings, while acknowledging biases from subset selection and other limitations such as model scale and reliance on curated data.

Abstract

Supervised finetuning (SFT) on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities observed in modern large language models (LLMs). However, the annotation efforts required to produce high quality responses for instructions are becoming prohibitively expensive, especially as the number of tasks spanned by instruction datasets continues to increase. Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool, but its high computational cost remains a barrier to its widespread applicability in the context of LLMs. To mitigate the annotation cost of SFT and circumvent the computational bottlenecks of active learning, we propose using experimental design. Experimental design techniques select the most informative samples to label, and typically maximize some notion of uncertainty and/or diversity. In our work, we implement a framework that evaluates several existing and novel experimental design techniques and find that these methods consistently yield significant gains in label efficiency with little computational overhead. On generative tasks, our methods achieve the same generalization performance with only $50\%$ of annotation cost required by random sampling.

An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models

TL;DR

The paper tackles the rising annotation burden in supervised finetuning of large language models by introducing an experimental-design framework that selects an informative, fixed set of prompts to annotate in one shot, avoiding the iterative retraining costs of active learning. It develops and evaluates uncertainty-based, k-center, and submodular selection strategies, including novel scores like maximum token uncertainty and a facility-location-based approach, demonstrating substantial label-efficiency gains. On a 7B-scale LLaMA-2 finetuned with LoRA over a 99K FLAN V2 prompt pool, the methods achieve roughly 50% annotation cost savings while maintaining or improving generalization on MMLU and BBH benchmarks, with GPT-4-based evaluations corroborating the improvements. This approach promises scalable, cost-effective instruction tuning for domain-specific or compute-constrained settings, while acknowledging biases from subset selection and other limitations such as model scale and reliance on curated data.

Abstract

Supervised finetuning (SFT) on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities observed in modern large language models (LLMs). However, the annotation efforts required to produce high quality responses for instructions are becoming prohibitively expensive, especially as the number of tasks spanned by instruction datasets continues to increase. Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool, but its high computational cost remains a barrier to its widespread applicability in the context of LLMs. To mitigate the annotation cost of SFT and circumvent the computational bottlenecks of active learning, we propose using experimental design. Experimental design techniques select the most informative samples to label, and typically maximize some notion of uncertainty and/or diversity. In our work, we implement a framework that evaluates several existing and novel experimental design techniques and find that these methods consistently yield significant gains in label efficiency with little computational overhead. On generative tasks, our methods achieve the same generalization performance with only of annotation cost required by random sampling.
Paper Structure (24 sections, 8 equations, 3 figures, 4 tables)

This paper contains 24 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison between different annotation schemes for label-efficient SFT. Random sampling simply chooses prompts uniformly at random which underperforms as it is prone to redundancy and may oversample from the major modes. On the other hand, one can choose them more strategically both through active learning and experimental design. Active learning, however, is an adaptive procedure and requires computationally expensive model retraining and inference for every batch of annotation. In this paper, we study the problem through the lens of experimental design, which enjoys increased label-efficiency compared to random sampling, while incurring minimal computation cost compared to active learning.
  • Figure 2: Evaluation by GPT-4 turbo by comparing model trained on 45K prompts selected by various strategies with the model trained on 90K random prompts. We use the win rate weighted by the continuous preferences of the GPT-4 turbo model. Error bars are reported as the standard errors across prompts.
  • Figure 3: Plot of gains with set size as the course of greedy maximization for different kernel width $\gamma$; we run the greedy procedure till the budget of 45K is reached. The observed trend reveals that for higher $\gamma$, gains tend to attain a very small value (and continue to decrease linearly) even before 1K (for $\gamma = 10$) and 10K (for $\gamma = 1$) elements are selected. Although reducing $\gamma$ helps, gains continue to decrease sublinearly ($\gamma = 0.1$ after 20K). Notably, gains exhibit relative stability for $\gamma \in \{10^{-3}, 5\times 10^{-3}, 10^{-2}\}$ until we reach the desired budget of 45K, suggesting a potential range for $\gamma$.