Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models
Mayee F. Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, Christopher Ré
TL;DR
This work introduces Skill-it, a data-driven framework that treats LM training as the acquisition of interdependent skills organized into an ordered skill set via a skills graph. It provides formal definitions for skills and their dependencies, demonstrates their existence in synthetic and real data, and proposes two data-selection methods—skill-stratified sampling and an online Skill-it algorithm—that leverage the graph to improve data efficiency. Across continual pre-training, fine-tuning, and out-of-domain evaluation, Skill-it yields substantial gains (e.g., LEGO accuracy improvements and reduced losses on Natural Instructions tasks) and shows robustness with larger models and diverse data sources. By connecting data selection to a principled skills-based representation of learning, the paper offers a path toward more data-efficient LM training and a framework for understanding how data shapes model capabilities.
Abstract
The quality of training data impacts the performance of pre-trained large language models (LMs). Given a fixed budget of tokens, we study how to best select data that leads to good downstream model performance across tasks. We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data. If such an order exists, it can be utilized for improved understanding of LMs and for data-efficient training. Using this intuition, our framework formalizes the notion of a skill and of an ordered set of skills in terms of the associated data. First, using both synthetic and real data, we demonstrate that these ordered skill sets exist, and that their existence enables more advanced skills to be learned with less data when we train on their prerequisite skills. Second, using our proposed framework, we introduce an online data sampling algorithm, Skill-It, over mixtures of skills for both continual pre-training and fine-tuning regimes, where the objective is to efficiently learn multiple skills in the former and an individual skill in the latter. On the LEGO synthetic in the continual pre-training setting, Skill-It obtains 36.5 points higher accuracy than random sampling. On the Natural Instructions dataset in the fine-tuning setting, Skill-It reduces the validation loss on the target skill by 13.6% versus training on data associated with the target skill itself. We apply our skills framework on the recent RedPajama dataset to continually pre-train a 3B-parameter LM, achieving higher accuracy on the LM Evaluation Harness with 1B tokens than the baseline approach of sampling uniformly over data sources with 3B tokens.
