Table of Contents
Fetching ...

Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression

Cheng Yuan, Jiawei Shao, Chi Zhang, Xuelong Li

TL;DR

This work presents information capacity as a unified metric for evaluating LLM efficiency by linking text compression performance to inference cost. It formalizes IC with a bias-adjusted ratio that incorporates both symbol-length reductions from entropy coding and the computational cost captured by $\log N_M$, while explicitly accounting for tokenizer efficiency. Through extensive evaluation on 52 models spanning five heterogeneous datasets, the authors show that IC is consistent within model series and that mixture-of-experts architectures often yield the highest IC, with tokenizer efficiency and pretraining data quality significantly influencing results. A key contribution is a single-reference performance prediction method based on IC, which outperforms traditional power-law scaling and offers a practical tool for estimating NLL across scales, informing model selection and deployment in edge-cloud settings.

Abstract

Recent years have witnessed the rapid advancements of large language models (LLMs) and their expanding applications, leading to soaring demands for computational resources. The widespread adoption of test-time scaling further aggravates the tension between model capability and resource consumption, highlighting the importance of inference efficiency. However, a unified metric that accurately reflects an LLM's efficiency across different model sizes and architectures remains absent. Motivated by the correlation between compression and intelligence, we introduce information capacity, a measure of model efficiency based on text compression performance relative to computational complexity. Larger models can predict the next token more accurately, achieving greater compression gains but at higher computational costs. Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity. This metric enables a fair efficiency comparison across model series and accurate performance prediction within a model series. A distinctive feature of information capacity is that it incorporates tokenizer efficiency, which affects both input and output token counts but is often neglected in LLM evaluations. We assess the information capacity of 52 models on 5 heterogeneous datasets and observe consistent results on the influences of tokenizer efficiency, pretraining data, and the mixture-of-experts architecture.

Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression

TL;DR

This work presents information capacity as a unified metric for evaluating LLM efficiency by linking text compression performance to inference cost. It formalizes IC with a bias-adjusted ratio that incorporates both symbol-length reductions from entropy coding and the computational cost captured by , while explicitly accounting for tokenizer efficiency. Through extensive evaluation on 52 models spanning five heterogeneous datasets, the authors show that IC is consistent within model series and that mixture-of-experts architectures often yield the highest IC, with tokenizer efficiency and pretraining data quality significantly influencing results. A key contribution is a single-reference performance prediction method based on IC, which outperforms traditional power-law scaling and offers a practical tool for estimating NLL across scales, informing model selection and deployment in edge-cloud settings.

Abstract

Recent years have witnessed the rapid advancements of large language models (LLMs) and their expanding applications, leading to soaring demands for computational resources. The widespread adoption of test-time scaling further aggravates the tension between model capability and resource consumption, highlighting the importance of inference efficiency. However, a unified metric that accurately reflects an LLM's efficiency across different model sizes and architectures remains absent. Motivated by the correlation between compression and intelligence, we introduce information capacity, a measure of model efficiency based on text compression performance relative to computational complexity. Larger models can predict the next token more accurately, achieving greater compression gains but at higher computational costs. Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity. This metric enables a fair efficiency comparison across model series and accurate performance prediction within a model series. A distinctive feature of information capacity is that it incorporates tokenizer efficiency, which affects both input and output token counts but is often neglected in LLM evaluations. We assess the information capacity of 52 models on 5 heterogeneous datasets and observe consistent results on the influences of tokenizer efficiency, pretraining data, and the mixture-of-experts architecture.

Paper Structure

This paper contains 19 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Information capacity of mainstream open-source models. Motivated by the strong correlation between compression and intelligence, information capacity evaluates an LLM's efficiency by text compression performance relative to its computational complexity. Larger models can predict the next token more accurately, leading to higher compression gains but at increased computational costs. Consequently, a series of models with varying sizes exhibits consistent information capacity, which can be used to compare model capability across model series and predict model performance within a series.
  • Figure 2: Information capacity evaluated on mixed text without numerator bias. The information capacity calculated from (\ref{['eq:IC_compute']}) is decreasing almost linearly as the inference FLOPs increase, requiring at least two models to be trained to predict the performance of a different-sized model. Moreover, it is inconvenient to compare model capabilities across different model series.
  • Figure 3: Impact of tokenizer efficiency on information capacity. The information capacity scales almost linearly with the average text size per token across multiple datasets, with Pearson correlation coefficients consistently exceeding 0.98.
  • Figure 4: Impact of post-training on information capacity. The post-training of modern LLMs impairs the model's capability in predicting the next token for plain text, degrading the text compression efficiency and the information capacity. The latest LLMs utilize sophisticated post-training methods, which cause more severe degradations in compression performance.
  • Figure 5: Impact of softmax temperature on the cumulative probability of NLL. A low temperature concentrates estimated probabilities on high-valued logits, which reduces the NLL when the top prediction on the next token is correct but increases NLL penalties for errors. Consequently, a balanced temperature value minimizes the overall NLL, thereby maximizing the information capacity.
  • ...and 2 more figures