Table of Contents
Fetching ...

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Zichun Yu, Spandan Das, Chenyan Xiong

TL;DR

This paper introduces model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress.

Abstract

Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Specifically, we collect oracle data influence by locally probing the pretraining model and fine-tune a small data influence model to approximate it accurately. The data influence model then predicts data influence over the whole pretraining corpus and selects the most influential data for the next pretraining stage. Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks. It doubles the gains achieved by the state-of-the-art data selection approach that leverages larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analyses validate the effectiveness of the locally probed oracle data influence and the approximation with data influence models. Our code is open-sourced at https://github.com/cxcscmu/MATES.

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

TL;DR

This paper introduces model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress.

Abstract

Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Specifically, we collect oracle data influence by locally probing the pretraining model and fine-tune a small data influence model to approximate it accurately. The data influence model then predicts data influence over the whole pretraining corpus and selects the most influential data for the next pretraining stage. Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks. It doubles the gains achieved by the state-of-the-art data selection approach that leverages larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analyses validate the effectiveness of the locally probed oracle data influence and the approximation with data influence models. Our code is open-sourced at https://github.com/cxcscmu/MATES.
Paper Structure (30 sections, 16 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 30 sections, 16 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Correlation of locally probed data influences at different pretraining steps (a) and the zero-shot performance with model-aware data selection (b). The experiments are based on 1B models.
  • Figure 2: Overview of MATES. The language model is first pretrained with a random set of data. Then, a data influence model is trained to approximate data influences on the target performance of the pretraining model and select the most effective data for the next pretraining stage.
  • Figure 3: Downstream performance of 410M and 1B models w.r.t. pretraining FLOPs and steps. The data selection procedure of MATES only accounts for 21.7% and 11.5% of the total FLOPs for 410M and 1B models, respectively.
  • Figure 4: Oracle data influence distribution in the 410M setting with different reference tasks at 50k steps. MC: multiple choice. LM: language modeling. We also present the standard deviation of the distribution and the proportions of the data with positive/negative oracle data influence.
  • Figure 5: Static (based on a 10k or a 50k random-pretrained model checkpoint) data selection versus model-aware data selection in influence modeling and downstream accuracy.
  • ...and 4 more figures