Speculative Coreset Selection for Task-Specific Fine-tuning

Xiaoyu Zhang; Juan Zhai; Shiqing Ma; Chao Shen; Tianlin Li; Weipeng Jiang; Yang Liu

Speculative Coreset Selection for Task-Specific Fine-tuning

Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen, Tianlin Li, Weipeng Jiang, Yang Liu

TL;DR

Task-specific fine-tuning of large language models incurs high computational cost and data inefficiency. STAFF addresses this by a two-stage speculative coreset selection: a small model from the same family cheaply estimates per-sample importance using the effort-based score $S_d^s = \left\| \nabla_{\phi} L(\theta_s(d)) \right\|_2$, followed by verification on the target LLM that stratifies data into $K$ regions and allocates a region-wise budget using $\mathcal{V}_i = \dfrac{\sum_{d \in B_i^*} S_d^t}{\sum_{d \\in B_i^*} S_d^s}$ and $m_B = \left\lfloor \dfrac{(m - |{\mathbb{D}}'|) \mathcal{V}_i}{|\mathcal{B}|} \right\rfloor$. This approach balances data importance and diversity, achieving up to 54.3% gains over SOTA and up to 70.5% reductions in selection overhead across pruning rates, with low-rate coresets (e.g., 20%) sometimes surpassing the full dataset. The method is validated on three LLMs and three downstream tasks, using a consistent family-based small model and LoRA fine-tuning, and its code is released for reproducibility.

Abstract

Task-specific fine-tuning is essential for the deployment of large language models (LLMs), but it requires significant computational resources and time. Existing solutions have proposed coreset selection methods to improve data efficiency and reduce model training overhead, but they still have limitations: 1) Overlooking valuable samples at high pruning rates, which degrades the coreset's performance. 2) Requiring high time overhead during coreset selection to fine-tune and evaluate the target LLM. In this paper, we introduce STAFF, a speculative coreset selection method. STAFF leverages a small model from the same family as the target LLM to efficiently estimate data scores and then verifies the scores on the target LLM to accurately identify and allocate more selection budget to important regions while maintaining coverage of easy regions. We evaluate STAFF on three LLMs and three downstream tasks and show that STAFF improves the performance of SOTA methods by up to 54.3% and reduces selection overhead by up to 70.5% at different pruning rates. Furthermore, we observe that the coreset selected by STAFF at low pruning rates (i.e., 20%) can even obtain better fine-tuning performance than the full dataset.

Speculative Coreset Selection for Task-Specific Fine-tuning

TL;DR

, followed by verification on the target LLM that stratifies data into

regions and allocates a region-wise budget using

and

. This approach balances data importance and diversity, achieving up to 54.3% gains over SOTA and up to 70.5% reductions in selection overhead across pruning rates, with low-rate coresets (e.g., 20%) sometimes surpassing the full dataset. The method is validated on three LLMs and three downstream tasks, using a consistent family-based small model and LoRA fine-tuning, and its code is released for reproducibility.

Abstract

Paper Structure (18 sections, 5 equations, 4 figures, 14 tables, 1 algorithm)

This paper contains 18 sections, 5 equations, 4 figures, 14 tables, 1 algorithm.

Introduction
Preliminaries
Coreset selection for Task-specific Fine-tuning
Speculative Execution
Methodology
Speculative score calculation
LLM verification & selection
Experiment
Experiment Setup
Comparison with baselines
Ablation Study
Conclusion
Appendix
Datasets
Baslines
...and 3 more sections

Figures (4)

Figure 1: Experiment results on the WMT-19 dataset and Gemma-7b model. a) reveals the effectiveness of Staff in coreset selection across different pruning rates. b) shows the low selection overhead of Staff.
Figure 2: Speculative execution uses the speculative task $Spec(\cdot)$ to speed up the upcoming task $T$.
Figure 3: The Overview of Staff. In LLM verification and selection, Staffa) verifies the score of different data regions on the target LLM and then b) adjusts the selection budget based on the difference between the speculative score and the verification score on the target LLM (e.g., 'Region A&B') to cover data regions that are important to the target LLM.
Figure 4: The data score distribution of different models on the WMT-19 dataset. a) The data scores are highly similar across models in the same family (e.g., Gemma-7b and Gemma-2b). b) There are significant differences in score distributions across models from different families (e.g., Gemma-7b and LLama-160M miao2024specinfer).

Speculative Coreset Selection for Task-Specific Fine-tuning

TL;DR

Abstract

Speculative Coreset Selection for Task-Specific Fine-tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)