Table of Contents
Fetching ...

Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

Dylan Sam, Ayan Chakrabarti, Afshin Rostamizadeh, Srikumar Ramalingam, Gui Citovsky, Sanjiv Kumar

TL;DR

This paper investigates how to measure similarity between pretraining examples for language model data curation. It proposes a three-part evaluation framework that examines loss-generalization, diversification usefulness, and data-source separation. Experiments on the Pile with a 1.7B decoder model show that off-the-shelf embeddings underperform compared with simple, specialized embeddings derived from smaller models trained on the same data, with LM Output Embeds often yielding the best results. The findings advocate task-aligned embeddings and provide a practical framework for developing embedding models tailored to pretraining data curation.

Abstract

Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework's first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and evaluating downstream task performance. Finally, we evaluate the capabilities of embeddings to distinguish between examples from different data sources. With these evaluations, we demonstrate that standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting, underperforming even remarkably simple embeddings that are extracted from models trained on the same pretraining corpus. Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.

Analyzing Similarity Metrics for Data Selection for Language Model Pretraining

TL;DR

This paper investigates how to measure similarity between pretraining examples for language model data curation. It proposes a three-part evaluation framework that examines loss-generalization, diversification usefulness, and data-source separation. Experiments on the Pile with a 1.7B decoder model show that off-the-shelf embeddings underperform compared with simple, specialized embeddings derived from smaller models trained on the same data, with LM Output Embeds often yielding the best results. The findings advocate task-aligned embeddings and provide a practical framework for developing embedding models tailored to pretraining data curation.

Abstract

Measuring similarity between training examples is critical for curating high-quality and diverse pretraining datasets for language models. However, similarity is typically computed with a generic off-the-shelf embedding model that has been trained for tasks such as retrieval. Whether these embedding-based similarity metrics are well-suited for pretraining data selection remains largely unexplored. In this paper, we propose a new framework to assess the suitability of a similarity metric specifically for data curation in language model pretraining applications. Our framework's first evaluation criterion captures how well distances reflect generalization in pretraining loss between different training examples. Next, we use each embedding model to guide a standard diversity-based data curation algorithm and measure its utility by pretraining a language model on the selected data and evaluating downstream task performance. Finally, we evaluate the capabilities of embeddings to distinguish between examples from different data sources. With these evaluations, we demonstrate that standard off-the-shelf embedding models are not well-suited for the pretraining data curation setting, underperforming even remarkably simple embeddings that are extracted from models trained on the same pretraining corpus. Our experiments are performed on the Pile, for pretraining a 1.7B parameter language model on 200B tokens. We believe our analysis and evaluation framework serves as a foundation for the future design of embeddings that specifically reason about similarity in pretraining datasets.

Paper Structure

This paper contains 33 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: A visualization of the correlation between pretraining loss and embedding distance. Each row shows a pair of examples close in embedding space (from the same K-means cluster), with examples in different rows being far from each other (from different clusters). We find that close pairs of examples tend to have similar pretraining losses, while there is a greater variation in losses across clusters. Close example pairs are "thematically" similar but have different content. These results are from averaged embeddings from the final layer of a small decoder-only language model.
  • Figure 2: Variance reduction as we vary average cluster size. Larger values are better. Results are computed over 50 million sampled clusters from the Pile, where pretraining losses are computed after 26k gradient steps. Specialized embeddings yield higher variance reduction than off-the-shelf models for all cluster sizes.
  • Figure 3: Variance reduction as we increase the number of gradient steps in pretraining. Larger values are better. Results are computed over 50 million sampled clusters from the Pile with an average cluster size of 50. Benefits in variance reduction remain consistent throughout pretraining.
  • Figure 4: Comparison of the purity with respect to data source of K-Means clustering produced by various embedding models on the Pile, when averaged over 50 million clusters from the Pile. Specialized embedding models have higher cluster purity scores.
  • Figure 5: Ablation on the number of components in PCA for Gecko and LM Output Embeds. Results are averaged over 50 million sampled clusters from the Pile. Using more components in PCA better clusters points with similar pretraining loss.
  • ...and 3 more figures