Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy
Minsang Kim, Seungjun Baek
TL;DR
The paper addresses the compute intensity of pretraining large language models by introducing an entropy-based data pruning strategy. It defines sample informativeness via $ abla$H(W,p) (entropy under the true distribution), $ abla$H(W,q) (autoregressive surprisal from a data probe), and $ abla$H(W,f) (inverse word-frequency surprisal), combining them as $H(W,q)+H(W,f)$ to rank and prune data. Experiments show the method yields better perplexity and downstream task performance than random or perplexity-only pruning, maintaining or improving results with substantial data pruning. This approach promises compute-efficient LLM training with robust generalization, and is argued to scale with model/data size under neural scaling laws.
Abstract
Compute-efficient training of language models has become an important issue. We consider data pruning for data-efficient training of LLMs. In this work, we consider a data pruning method based on information entropy. We propose that the samples in the training corpus be ranked in terms of their informativeness which we estimate through entropy functions. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We use the entropy functions based on the negative log-likelihood and the average inverse word frequency of a sample as a surrogate to measure its informativeness. Experiments reveal that the proposed information-based pruning can improve upon various language modeling and downstream tasks, and enhance the generalization capability of language models.
