Table of Contents
Fetching ...

Entropy Law: The Story Behind Data Compression and LLM Performance

Mingjia Yin, Chuhan Wu, Yufei Wang, Hao Wang, Wei Guo, Yasheng Wang, Yong Liu, Ruiming Tang, Defu Lian, Enhong Chen

TL;DR

This work reframes data selection for LLM training through an entropy law that links model performance to data compression and first-epoch loss, proposing that lower compression ratio and manageable loss yield better knowledge mastery. It introduces ZIP, a model-free, multi-stage greedy algorithm that minimizes data compression ratio while preserving diversity, enabling efficient data selection for SFT and RLHF. Empirical results show ZIP consistently outperforms quality-based baselines across multiple backbones and alignment stages, and the entropy law is validated by observed relationships between compression, loss, and performance, with a practical use in early risk detection during incremental data updates. The approach offers a scalable, compute-conscious pathway to improve LLM learning efficiency and provides a predictive criterion for data utility in alignment tasks.

Abstract

Data is the cornerstone of large language models (LLMs), but not all data is useful for model learning. Carefully selected data can better elicit the capabilities of LLMs with much less computational overhead. Most methods concentrate on evaluating the quality of individual samples in data selection, while the combinatorial effects among samples are neglected. Even if each sample is of perfect quality, their combinations may be suboptimal in teaching LLMs due to their intrinsic homogeneity or contradiction. In this paper, we aim to uncover the underlying relationships between LLM performance and data selection. Inspired by the information compression nature of LLMs, we uncover an ``entropy law'' that connects LLM performance with data compression ratio and first-epoch training loss, which reflect the information redundancy of a dataset and the mastery of inherent knowledge encoded in this dataset, respectively. Through both theoretical deduction and empirical evaluation, we find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method named \textbf{ZIP} for training LLMs, which aim to prioritize data subsets exhibiting a low compression ratio. Based on a multi-stage algorithm that selects diverse data in a greedy manner, we can obtain a good data subset with satisfactory diversity. Extensive experiments have been conducted to validate the entropy law and the superiority of ZIP across different LLM backbones and alignment stages. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.

Entropy Law: The Story Behind Data Compression and LLM Performance

TL;DR

This work reframes data selection for LLM training through an entropy law that links model performance to data compression and first-epoch loss, proposing that lower compression ratio and manageable loss yield better knowledge mastery. It introduces ZIP, a model-free, multi-stage greedy algorithm that minimizes data compression ratio while preserving diversity, enabling efficient data selection for SFT and RLHF. Empirical results show ZIP consistently outperforms quality-based baselines across multiple backbones and alignment stages, and the entropy law is validated by observed relationships between compression, loss, and performance, with a practical use in early risk detection during incremental data updates. The approach offers a scalable, compute-conscious pathway to improve LLM learning efficiency and provides a predictive criterion for data utility in alignment tasks.

Abstract

Data is the cornerstone of large language models (LLMs), but not all data is useful for model learning. Carefully selected data can better elicit the capabilities of LLMs with much less computational overhead. Most methods concentrate on evaluating the quality of individual samples in data selection, while the combinatorial effects among samples are neglected. Even if each sample is of perfect quality, their combinations may be suboptimal in teaching LLMs due to their intrinsic homogeneity or contradiction. In this paper, we aim to uncover the underlying relationships between LLM performance and data selection. Inspired by the information compression nature of LLMs, we uncover an ``entropy law'' that connects LLM performance with data compression ratio and first-epoch training loss, which reflect the information redundancy of a dataset and the mastery of inherent knowledge encoded in this dataset, respectively. Through both theoretical deduction and empirical evaluation, we find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method named \textbf{ZIP} for training LLMs, which aim to prioritize data subsets exhibiting a low compression ratio. Based on a multi-stage algorithm that selects diverse data in a greedy manner, we can obtain a good data subset with satisfactory diversity. Extensive experiments have been conducted to validate the entropy law and the superiority of ZIP across different LLM backbones and alignment stages. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
Paper Structure (23 sections, 7 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 23 sections, 7 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: An illustrative example describing different data selection paradigms. Quality-based data selection relies on sample-level data quality measurements while overlooking combinatorial effects among samples. Information-amount-based selection aims to select samples maximizing the overall information amount.
  • Figure 2: The distribution of average token number across datasets selected by different algorithms for Mistral-7B.
  • Figure 3: Entropy law demonstration of Mistral-7B. The Entropy law curve is fitted with the results of different methods.
  • Figure 4: Entropy law curve of Llama-3-8B. The Entropy law curve is fitted with the results of different methods.
  • Figure 5: Practical application of Entropy law in incremental training data update, where $x_1,x_2,x_3,x_4,x_5$ are five data versions.
  • ...and 2 more figures