Table of Contents
Fetching ...

Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models

Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, Christopher Ré

TL;DR

This work introduces Skill-it, a data-driven framework that treats LM training as the acquisition of interdependent skills organized into an ordered skill set via a skills graph. It provides formal definitions for skills and their dependencies, demonstrates their existence in synthetic and real data, and proposes two data-selection methods—skill-stratified sampling and an online Skill-it algorithm—that leverage the graph to improve data efficiency. Across continual pre-training, fine-tuning, and out-of-domain evaluation, Skill-it yields substantial gains (e.g., LEGO accuracy improvements and reduced losses on Natural Instructions tasks) and shows robustness with larger models and diverse data sources. By connecting data selection to a principled skills-based representation of learning, the paper offers a path toward more data-efficient LM training and a framework for understanding how data shapes model capabilities.

Abstract

The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself. We apply our skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.

Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models

TL;DR

This work introduces Skill-it, a data-driven framework that treats LM training as the acquisition of interdependent skills organized into an ordered skill set via a skills graph. It provides formal definitions for skills and their dependencies, demonstrates their existence in synthetic and real data, and proposes two data-selection methods—skill-stratified sampling and an online Skill-it algorithm—that leverage the graph to improve data efficiency. Across continual pre-training, fine-tuning, and out-of-domain evaluation, Skill-it yields substantial gains (e.g., LEGO accuracy improvements and reduced losses on Natural Instructions tasks) and shows robustness with larger models and diverse data sources. By connecting data selection to a principled skills-based representation of learning, the paper offers a path toward more data-efficient LM training and a framework for understanding how data shapes model capabilities.

Abstract

The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself. We apply our skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.
Paper Structure (47 sections, 13 equations, 33 figures, 10 tables, 3 algorithms)

This paper contains 47 sections, 13 equations, 33 figures, 10 tables, 3 algorithms.

Figures (33)

  • Figure 1: Inspired by how humans acquire knowledge, we hypothesize that LMs best learn skills in a particular order and that this can help improve our understanding and training of LMs. We show that these ordered skill sets exist in real data, which enables skills to be learned with less data given that we train on their prerequisite skills. We then propose Skill-it, an online data selection algorithm that learns skills quickly by exploiting their ordering.
  • Figure 2: Heatmaps of adjacency matrices we compute for skill graphs for Alpaca, Pile of Law, and Natural Instructions. Negative elements and diagonals are thresholded to $0$ for clarity. See Appendix \ref{['supp:skill_graphs']} for descriptions of how they were constructed and larger versions.
  • Figure 3: On the LEGO synthetic, 3-digit addition, and Natural Instructions, we identify examples of ordered skill sets in which training on a mixture of skills helps learn an individual skill faster than just training on that skill itself, given a fixed training budget.
  • Figure 4: Performance of Skill-it on each skill in the continual pre-training setting (learning over all skills in the ordered training skill set) on the LEGO synthetic (left) and addition synthetic (right).
  • Figure 5: Performance of Skill-it in the fine-tuning setting (learning a target skill using the ordered training skill set) on LEGO, addition, and NI.
  • ...and 28 more figures

Theorems & Definitions (2)

  • Definition 2.1: Skill
  • Definition 2.2: Ordered skill set, skills graph