An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models
Gantavya Bhatt, Yifang Chen, Arnav M. Das, Jifan Zhang, Sang T. Truong, Stephen Mussmann, Yinglun Zhu, Jeffrey Bilmes, Simon S. Du, Kevin Jamieson, Jordan T. Ash, Robert D. Nowak
TL;DR
The paper tackles the rising annotation burden in supervised finetuning of large language models by introducing an experimental-design framework that selects an informative, fixed set of prompts to annotate in one shot, avoiding the iterative retraining costs of active learning. It develops and evaluates uncertainty-based, k-center, and submodular selection strategies, including novel scores like maximum token uncertainty and a facility-location-based approach, demonstrating substantial label-efficiency gains. On a 7B-scale LLaMA-2 finetuned with LoRA over a 99K FLAN V2 prompt pool, the methods achieve roughly 50% annotation cost savings while maintaining or improving generalization on MMLU and BBH benchmarks, with GPT-4-based evaluations corroborating the improvements. This approach promises scalable, cost-effective instruction tuning for domain-specific or compute-constrained settings, while acknowledging biases from subset selection and other limitations such as model scale and reliance on curated data.
Abstract
Supervised finetuning (SFT) on instruction datasets has played a crucial role in achieving the remarkable zero-shot generalization capabilities observed in modern large language models (LLMs). However, the annotation efforts required to produce high quality responses for instructions are becoming prohibitively expensive, especially as the number of tasks spanned by instruction datasets continues to increase. Active learning is effective in identifying useful subsets of samples to annotate from an unlabeled pool, but its high computational cost remains a barrier to its widespread applicability in the context of LLMs. To mitigate the annotation cost of SFT and circumvent the computational bottlenecks of active learning, we propose using experimental design. Experimental design techniques select the most informative samples to label, and typically maximize some notion of uncertainty and/or diversity. In our work, we implement a framework that evaluates several existing and novel experimental design techniques and find that these methods consistently yield significant gains in label efficiency with little computational overhead. On generative tasks, our methods achieve the same generalization performance with only $50\%$ of annotation cost required by random sampling.
