Speculative Coreset Selection for Task-Specific Fine-tuning
Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen, Tianlin Li, Weipeng Jiang, Yang Liu
TL;DR
Task-specific fine-tuning of large language models incurs high computational cost and data inefficiency. STAFF addresses this by a two-stage speculative coreset selection: a small model from the same family cheaply estimates per-sample importance using the effort-based score $S_d^s = \left\| \nabla_{\phi} L(\theta_s(d)) \right\|_2$, followed by verification on the target LLM that stratifies data into $K$ regions and allocates a region-wise budget using $\mathcal{V}_i = \dfrac{\sum_{d \in B_i^*} S_d^t}{\sum_{d \\in B_i^*} S_d^s}$ and $m_B = \left\lfloor \dfrac{(m - |{\mathbb{D}}'|) \mathcal{V}_i}{|\mathcal{B}|} \right\rfloor$. This approach balances data importance and diversity, achieving up to 54.3% gains over SOTA and up to 70.5% reductions in selection overhead across pruning rates, with low-rate coresets (e.g., 20%) sometimes surpassing the full dataset. The method is validated on three LLMs and three downstream tasks, using a consistent family-based small model and LoRA fine-tuning, and its code is released for reproducibility.
Abstract
Task-specific fine-tuning is essential for the deployment of large language models (LLMs), but it requires significant computational resources and time. Existing solutions have proposed coreset selection methods to improve data efficiency and reduce model training overhead, but they still have limitations: 1) Overlooking valuable samples at high pruning rates, which degrades the coreset's performance. 2) Requiring high time overhead during coreset selection to fine-tune and evaluate the target LLM. In this paper, we introduce STAFF, a speculative coreset selection method. STAFF leverages a small model from the same family as the target LLM to efficiently estimate data scores and then verifies the scores on the target LLM to accurately identify and allocate more selection budget to important regions while maintaining coverage of easy regions. We evaluate STAFF on three LLMs and three downstream tasks and show that STAFF improves the performance of SOTA methods by up to 54.3% and reduces selection overhead by up to 70.5% at different pruning rates. Furthermore, we observe that the coreset selected by STAFF at low pruning rates (i.e., 20%) can even obtain better fine-tuning performance than the full dataset.
