Table of Contents
Fetching ...

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

Nan He, Weichen Xiong, Hanwen Liu, Yi Liao, Lei Ding, Kai Zhang, Guohua Tang, Xiao Han, Wei Yang

TL;DR

The paper tackles data duplication in large-scale language model pre-training by introducing SoftDedup, a data reweighting method based on data commonness. It computes a commonness score $p(x)$ from a 4-gram model with Kneser–Ney smoothing and sets sampling weights $W(x) \propto 1/p(x)$, avoiding discarding data. Empirical results show SoftDedup achieves comparable perplexity with at least a $26\%$ reduction in training steps and improves average downstream accuracy by $1.77\%$, while also outperforming hard deduplication and complementing existing deduplication pipelines. The approach is computationally efficient (CPU-based) and practical for integration into standard pre-training workflows, offering a pathway to faster, more data-efficient LLM pre-training at scale.

Abstract

The effectiveness of large language models (LLMs) is often hindered by duplicated data in their extensive pre-training datasets. Current approaches primarily focus on detecting and removing duplicates, which risks the loss of valuable information and neglects the varying degrees of duplication. To address this, we propose a soft deduplication method that maintains dataset integrity while selectively reducing the sampling weight of data with high commonness. Central to our approach is the concept of "data commonness", a metric we introduce to quantify the degree of duplication by measuring the occurrence probabilities of samples using an n-gram model. Empirical analysis shows that this method significantly improves training efficiency, achieving comparable perplexity scores with at least a 26% reduction in required training steps. Additionally, it enhances average few-shot downstream accuracy by 1.77% when trained for an equivalent duration. Importantly, this approach consistently improves performance, even on rigorously deduplicated datasets, indicating its potential to complement existing methods and become a standard pre-training process for LLMs.

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

TL;DR

The paper tackles data duplication in large-scale language model pre-training by introducing SoftDedup, a data reweighting method based on data commonness. It computes a commonness score from a 4-gram model with Kneser–Ney smoothing and sets sampling weights , avoiding discarding data. Empirical results show SoftDedup achieves comparable perplexity with at least a reduction in training steps and improves average downstream accuracy by , while also outperforming hard deduplication and complementing existing deduplication pipelines. The approach is computationally efficient (CPU-based) and practical for integration into standard pre-training workflows, offering a pathway to faster, more data-efficient LLM pre-training at scale.

Abstract

The effectiveness of large language models (LLMs) is often hindered by duplicated data in their extensive pre-training datasets. Current approaches primarily focus on detecting and removing duplicates, which risks the loss of valuable information and neglects the varying degrees of duplication. To address this, we propose a soft deduplication method that maintains dataset integrity while selectively reducing the sampling weight of data with high commonness. Central to our approach is the concept of "data commonness", a metric we introduce to quantify the degree of duplication by measuring the occurrence probabilities of samples using an n-gram model. Empirical analysis shows that this method significantly improves training efficiency, achieving comparable perplexity scores with at least a 26% reduction in required training steps. Additionally, it enhances average few-shot downstream accuracy by 1.77% when trained for an equivalent duration. Importantly, this approach consistently improves performance, even on rigorously deduplicated datasets, indicating its potential to complement existing methods and become a standard pre-training process for LLMs.
Paper Structure (27 sections, 4 equations, 7 figures, 4 tables)

This paper contains 27 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Hard deduplication versus soft deduplication. Hard deduplication identifies and removes duplicate samples. Soft deduplication identifies samples with high commonness, decreasing their sampling weight during training. Here, a sample refers to a document within the original corpus.
  • Figure 2: We aim to obtain a more balanced training set from a large raw dataset through data reweighting. Initially, we train an n-gram model using the raw dataset to calculate the commonness of each sample within the corpus. Following this, we partition the dataset and assign weights according to data commonness. Samples with higher commonness are assigned lower sampling weights, while those with lower commonness receive higher sampling weights. The weighted data is then used for the pre-training of a language model.
  • Figure 3: Performance evaluation results of models trained on the RedPajama CommonCrawl dataset. Figures \ref{['fig:redcc_pile']} and \ref{['fig:redcc_slimpajama']} display the average perplexity on the Pile and SlimPajama test sets, respectively. Figure \ref{['fig:redcc_downstream']} illustrates the average accuracy on various downstream tasks. Our methodology involves a data partitioning number of 20 and a 10-fold weight disparity between the maximum and minimum weights. Baseline refers to direct training.
  • Figure 4: Performance evaluation results of models trained on the SlimPajama CommonCrawl dataset. Figures \ref{['fig:slimcc_pile']} and \ref{['fig:slimcc_slimpajama']} display the average perplexity on the Pile and SlimPajama test sets, respectively. Figure \ref{['fig:slimcc_downstream']} illustrates the average accuracy on various downstream tasks. Our methodology involves a data partitioning number of 20 and a 10-fold weight disparity between the maximum and minimum weights. Baseline refers to direct training.
  • Figure 5: Performance evaluation results of models trained on the Falcon RefinedWeb dataset.
  • ...and 2 more figures