Small Language Model as Data Prospector for Large Language Model
Shiwen Ni, Haihong Wu, Di Yang, Qiang Qu, Hamid Alinejad-Rokny, Min Yang
TL;DR
The paper addresses the data quality bottleneck in instruction fine-tuning of large language models by introducing SuperNUGGETS, a data-prospecting framework that uses a small language model to filter high-quality one-shot examples and refines the predefined test set. It combines a predefined task set refinement (2.1) with SLM-based scoring (2.2) to efficiently identify impactful instruction data, achieving a 1-2% performance drop compared to NUGGETS but up to ~58x faster computation. Experimental results on Alpaca with multiple prospectors show that selecting the top 5% data can outperform full-data fine-tuning, and using a refined 100-example test set significantly reduces computation without sacrificing performance. The findings suggest substantial practical value for scalable instruction tuning, while acknowledging limitations in scaling to multi-billion parameter regimes and the need for further validation on larger models.
Abstract
The quality of instruction data directly affects the performance of fine-tuned Large Language Models (LLMs). Previously, \cite{li2023one} proposed \texttt{NUGGETS}, which identifies and selects high-quality quality data from a large dataset by identifying those individual instruction examples that can significantly improve the performance of different tasks after being learnt as one-shot instances. In this work, we propose \texttt{SuperNUGGETS}, an improved variant of \texttt{NUGGETS} optimised for efficiency and performance. Our \texttt{SuperNUGGETS} uses a small language model (SLM) instead of a large language model (LLM) to filter the data for outstanding one-shot instances and refines the predefined set of tests. The experimental results show that the performance of \texttt{SuperNUGGETS} only decreases by 1-2% compared to \texttt{NUGGETS}, but the efficiency can be increased by a factor of 58. Compared to the original \texttt{NUGGETS}, our \texttt{SuperNUGGETS} has a higher utility value due to the significantly lower resource consumption.
