Table of Contents
Fetching ...

Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy

Minsang Kim, Seungjun Baek

TL;DR

The paper addresses the compute intensity of pretraining large language models by introducing an entropy-based data pruning strategy. It defines sample informativeness via $ abla$H(W,p) (entropy under the true distribution), $ abla$H(W,q) (autoregressive surprisal from a data probe), and $ abla$H(W,f) (inverse word-frequency surprisal), combining them as $H(W,q)+H(W,f)$ to rank and prune data. Experiments show the method yields better perplexity and downstream task performance than random or perplexity-only pruning, maintaining or improving results with substantial data pruning. This approach promises compute-efficient LLM training with robust generalization, and is argued to scale with model/data size under neural scaling laws.

Abstract

Compute-efficient training of language models has become an important issue. We consider data pruning for data-efficient training of LLMs. In this work, we consider a data pruning method based on information entropy. We propose that the samples in the training corpus be ranked in terms of their informativeness which we estimate through entropy functions. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We use the entropy functions based on the negative log-likelihood and the average inverse word frequency of a sample as a surrogate to measure its informativeness. Experiments reveal that the proposed information-based pruning can improve upon various language modeling and downstream tasks, and enhance the generalization capability of language models.

Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy

TL;DR

The paper addresses the compute intensity of pretraining large language models by introducing an entropy-based data pruning strategy. It defines sample informativeness via H(W,p) (entropy under the true distribution), H(W,q) (autoregressive surprisal from a data probe), and H(W,f) (inverse word-frequency surprisal), combining them as to rank and prune data. Experiments show the method yields better perplexity and downstream task performance than random or perplexity-only pruning, maintaining or improving results with substantial data pruning. This approach promises compute-efficient LLM training with robust generalization, and is argued to scale with model/data size under neural scaling laws.

Abstract

Compute-efficient training of language models has become an important issue. We consider data pruning for data-efficient training of LLMs. In this work, we consider a data pruning method based on information entropy. We propose that the samples in the training corpus be ranked in terms of their informativeness which we estimate through entropy functions. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We use the entropy functions based on the negative log-likelihood and the average inverse word frequency of a sample as a surrogate to measure its informativeness. Experiments reveal that the proposed information-based pruning can improve upon various language modeling and downstream tasks, and enhance the generalization capability of language models.
Paper Structure (16 sections, 5 equations, 3 figures, 4 tables)

This paper contains 16 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Left: Text classification results on CoLA. Right: Text similarity results on SST2. The black line indicates the performance without pruning.
  • Figure 2: Loss curve of GPT-125M over 1 sweep of the training dataset. For data efficiency, we stopped training the probe model at the point where the decrease in loss saturates, i.e., at about 12% of the entire dataset.
  • Figure 3: Left: Test loss of 125M GPT per training token. Right: Test loss of 345M GPT per training token.