MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Zichun Yu; Spandan Das; Chenyan Xiong

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Zichun Yu, Spandan Das, Chenyan Xiong

TL;DR

This paper introduces model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress.

Abstract

Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the pretraining model and then selects the data most effective for the current pretraining progress. Specifically, we collect oracle data influence by locally probing the pretraining model and fine-tune a small data influence model to approximate it accurately. The data influence model then predicts data influence over the whole pretraining corpus and selects the most influential data for the next pretraining stage. Experiments of pretraining 410M and 1B models on the C4 dataset demonstrate that MATES significantly outperforms random data selection on extensive downstream tasks. It doubles the gains achieved by the state-of-the-art data selection approach that leverages larger reference models and reduces the total FLOPs required to reach certain performances by half. Further analyses validate the effectiveness of the locally probed oracle data influence and the approximation with data influence models. Our code is open-sourced at https://github.com/cxcscmu/MATES.

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

TL;DR

Abstract

Paper Structure (30 sections, 16 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 30 sections, 16 equations, 9 figures, 7 tables, 1 algorithm.

Introduction
Related work
Methods
Model-aware data selection framework
Locally probed oracle data influence
Experimental methodologies
Evaluation results
Overall performance
MATES outperforms the state-of-the-art data selection approach.
MATES selects the data with low costs.
MATES significantly elevates the scaling curves.
Effectiveness of locally probed oracle data influence
Effectiveness of data influence model
Case study
Discussion and limitations
...and 15 more sections

Figures (9)

Figure 1: Correlation of locally probed data influences at different pretraining steps (a) and the zero-shot performance with model-aware data selection (b). The experiments are based on 1B models.
Figure 2: Overview of MATES. The language model is first pretrained with a random set of data. Then, a data influence model is trained to approximate data influences on the target performance of the pretraining model and select the most effective data for the next pretraining stage.
Figure 3: Downstream performance of 410M and 1B models w.r.t. pretraining FLOPs and steps. The data selection procedure of MATES only accounts for 21.7% and 11.5% of the total FLOPs for 410M and 1B models, respectively.
Figure 4: Oracle data influence distribution in the 410M setting with different reference tasks at 50k steps. MC: multiple choice. LM: language modeling. We also present the standard deviation of the distribution and the proportions of the data with positive/negative oracle data influence.
Figure 5: Static (based on a 10k or a 50k random-pretrained model checkpoint) data selection versus model-aware data selection in influence modeling and downstream accuracy.
...and 4 more figures

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

TL;DR

Abstract

MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)