Table of Contents
Fetching ...

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman

TL;DR

S2L introduces a scalable data-selection method for supervised fine-tuning in specialized-domain LLMs by leveraging loss trajectories from a small proxy model. It clusters these trajectories and uniformly samples from clusters to form a representative subset, with theoretical guarantees on gradient similarity and convergence. Empirically, S2L achieves substantial data-efficiency gains across mathematical reasoning and clinical text summarization, even transferring subsets to larger target models and reducing data and compute costs. The approach is robust to hyperparameters and proxy choice, though its evaluation is limited to two domains and assumes fixed training schedules, suggesting avenues for broader validation and optimization.

Abstract

Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.

SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

TL;DR

S2L introduces a scalable data-selection method for supervised fine-tuning in specialized-domain LLMs by leveraging loss trajectories from a small proxy model. It clusters these trajectories and uniformly samples from clusters to form a representative subset, with theoretical guarantees on gradient similarity and convergence. Empirically, S2L achieves substantial data-efficiency gains across mathematical reasoning and clinical text summarization, even transferring subsets to larger target models and reducing data and compute costs. The approach is robust to hyperparameters and proxy choice, though its evaluation is limited to two domains and assumes fixed training schedules, suggesting avenues for broader validation and optimization.

Abstract

Despite the effectiveness of data selection for large language models (LLMs) during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.
Paper Structure (39 sections, 2 theorems, 24 equations, 18 figures, 5 tables, 1 algorithm)

This paper contains 39 sections, 2 theorems, 24 equations, 18 figures, 5 tables, 1 algorithm.

Key Result

Theorem 4.1

If examples $i$ and $j$ have similar loss trajectories on the proxy model, i.e., $\| \mathbf{L}_i^{\text{proxy}} - \mathbf{L}_j^{\text{proxy}} \| \leq \epsilon$, and their loss trajectories on the proxy and target model is similar, i.e., $\| \mathbf{L}_p^{\text{proxy}} - \mathbf{L}_p^{\text{target}} where $\epsilon' = \epsilon + 2\delta$ and $\|\bm{\theta}\| \leq D$ for all $t$.

Figures (18)

  • Figure 1: Existing data selection methods depend heavily on the feature representations from a reference model, which makes their effectiveness vulnerable to the quality of training on the target domain marion2023less. For supervised fine-tuning (SFT), while pretrained models can effectively separate topics (shown in different colors) in natural language (\ref{['fig:pretrain-pile']}), they struggle with fine-tuning data that deviates from the pretraining distribution (\ref{['fig:pretrain-math']}). Additionally, the cost of training a reference model escalates with model size (\ref{['fig:training-time']}), making existing data selection methods for large models prohibitively expensive.
  • Figure 2: Examples in the same clusters have very similar loss trajectories (\ref{['fig:same-cluster']}) while the loss trajectories of examples in different clusters are very different (\ref{['fig:diff-cluster']}).
  • Figure 4: Data Scaling: Accuracies ($\uparrow$) on in-domain and out-of-domain datasets using Pythia-410M. Data size refers to the total number of unique training examples used. All experiments in this table share the same total training steps and learning rate schedule (see \ref{['sec:math-train']}). See breakdowns in \ref{['fig:410m-full']}.
  • Figure 5: Wall-clock time required to train the reference model and select 100K data from MathInstruct for training Pythia-410M.
  • Figure 6: Distribution of the coverage of top-1 topic in each cluster. Taller bars on the right end of the plot indicate clusters with a higher concentration of a single topic and therefore suggest a grouping of similar examples.
  • ...and 13 more figures

Theorems & Definitions (3)

  • Theorem 4.1
  • Corollary 4.2
  • proof