Table of Contents
Fetching ...

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Feiyang Kang, Hoang Anh Just, Yifan Sun, Himanshu Jahagirdar, Yuanzhi Zhang, Rongxing Du, Anit Kumar Sahu, Ruoxi Jia

TL;DR

This work tackles the cost-inefficiency of adapting large language models to new tasks by exploiting unlabeled open data through principled data selection. It introduces GOT-D, a scalable data-selection method based on gradients of Optimal Transport to shift the pre-training distribution toward the target task, formalizing the notion of an effective data distribution during light fine-tuning. The authors provide a theoretical framework showing under low-data regimes the optimal subset minimizes an OT-based surrogate that bounds downstream loss, and they demonstrate practical efficiency by scaling to millions of samples on a single GPU. Empirically, GOT-D improves performance across detoxification, domain-specific NLU, and GLUE-like benchmarks, while reducing toxicity with only modest losses in general utility, highlighting its potential for cost-efficient fine-tuning of LLMs. The work also includes open-source code, underscoring its applicability to real-world, resource-constrained settings where rapid, data-efficient fine-tuning is essential.

Abstract

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: https://anonymous.4open.science/r/DV4LLM-D761/ ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

TL;DR

This work tackles the cost-inefficiency of adapting large language models to new tasks by exploiting unlabeled open data through principled data selection. It introduces GOT-D, a scalable data-selection method based on gradients of Optimal Transport to shift the pre-training distribution toward the target task, formalizing the notion of an effective data distribution during light fine-tuning. The authors provide a theoretical framework showing under low-data regimes the optimal subset minimizes an OT-based surrogate that bounds downstream loss, and they demonstrate practical efficiency by scaling to millions of samples on a single GPU. Empirically, GOT-D improves performance across detoxification, domain-specific NLU, and GLUE-like benchmarks, while reducing toxicity with only modest losses in general utility, highlighting its potential for cost-efficient fine-tuning of LLMs. The work also includes open-source code, underscoring its applicability to real-world, resource-constrained settings where rapid, data-efficient fine-tuning is essential.

Abstract

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: https://anonymous.4open.science/r/DV4LLM-D761/ ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.
Paper Structure (12 sections, 2 theorems, 3 equations, 3 figures, 2 tables)

This paper contains 12 sections, 2 theorems, 3 equations, 3 figures, 2 tables.

Key Result

Lemma 1

For a model $M^0$ pre-trained on $D_P$ with empirical loss minimization on loss $\mathcal{L}(D_P)$, when conducting light fine-tuning (i.e., for a single epoch or few epochs) on small data $D_U$ in a low-data regime where $N(D_U)\ll N(D_P)$, it equates to moving fine-tuned model $M^*(D_U)$ towards m

Figures (3)

  • Figure 1: Benefits of two-stage fine-tuning. All settings presented achieve the same task performance. Evaluation is performed on the CoLA dataset wang2018glue.
  • Figure 2: Data Selection Setting. Given a pretrained model trained on pretraining data (red), we select additional data (blue) to fine-tune the model for a target task. We divide fine-tuning into two parts: I. Pre-Fine-Tuning and II. Targeted Fine-Tuning. Since labeled target data (green) can be expensive to curate (II), we leverage large, open-source, unlabeled data to pre-fine-tune the model (I), which we call the candidate set. Thus, our goal becomes to select the best subset from the candidate set to best prepare the model for the target task for any limited selection budget.
  • Figure 3: Consider an LLM pre-trained on a large corpus of $99$% cat examples and $1$% dog examples. The target task consists of 50% cat examples and 50% dog examples. The model's relative lack of knowledge of dogs will be its performance bottleneck on the target task. Before deploying the LLM on the target task, we select samples from the pool of available data to perform lightweight warmup pre-fine-tuning to better prepare the model for the target task knowledge. Selecting data by matching distribution to the target task will end up selecting $50$% cat and $50$% dog examples, where only the $50$% dog examples will help. In low data regimes where the fine-tuning data is considerably small, this further loss of data efficiency prevents the model from achieving the best possible performance improvements. Our gradient-based selection will select $100$% dog examples, which best help the model to make up for the knowledge it lacks. In this case, our approach is able to double the data efficiency in fine-tuning, which will translate to increased performance gain on downstream tasks.

Theorems & Definitions (3)

  • Lemma 1: Effective data distribution for fine-tuned model
  • Theorem 1: Optimal data selection for fine-tuning a pre-trained model in low-data regime
  • Remark 1