Evaluating Large Language Models for Generalization and Robustness via Data Compression

Yucheng Li; Yunhao Guo; Frank Guerin; Chenghua Lin

Evaluating Large Language Models for Generalization and Robustness via Data Compression

Yucheng Li, Yunhao Guo, Frank Guerin, Chenghua Lin

TL;DR

This work introduces a lossless data compression framework to evaluate large language models, addressing data contamination and prompt sensitivity by measuring generalization on data that appears after a model's training cutoff. By computing $P(X)$ with a language model and applying arithmetic coding, the approach yields compressed lengths that reflect the model's predictive generalization across time, tested on diverse sources (Wikipedia, news, code, arXiv, images, audio) and a broad set of 14 models. Key findings show that compression performance diverges after the training cutoff, with notable variation across data domains; Mistral and Llama-2 strike a favorable balance between generalization and robustness, while context size and tokenization significantly affect results. The method correlates with established benchmarks like HumanEval and MMLU, offering a scalable, contamination-resistant metric that informs model evaluation and future research into domain-generalization and multi-modal compression.

Abstract

Existing methods for evaluating large language models face challenges such as data contamination, sensitivity to prompts, and the high cost of benchmark creation. To address this, we propose a lossless data compression based evaluation approach that tests how models' predictive abilities generalize after their training cutoff. Specifically, we collect comprehensive test data spanning 83 months from 2017 to 2023 and split the data into training and testing periods according to models' training data cutoff. We measure: 1) the compression performance on the testing period as a measure of generalization on unseen data; and 2) the performance gap between the training and testing period as a measure of robustness. Our experiments test 14 representative large language models with various sizes on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data. We find that the compression rate of many models reduces significantly after their cutoff date, but models such as Mistral and Llama-2 demonstrate a good balance between performance and robustness. Results also suggest that models struggle to generalize on news and code data, but work especially well on arXiv papers. We also find the context size and tokenization implementation have a big impact of on the overall compression performance.

Evaluating Large Language Models for Generalization and Robustness via Data Compression

TL;DR

with a language model and applying arithmetic coding, the approach yields compressed lengths that reflect the model's predictive generalization across time, tested on diverse sources (Wikipedia, news, code, arXiv, images, audio) and a broad set of 14 models. Key findings show that compression performance diverges after the training cutoff, with notable variation across data domains; Mistral and Llama-2 strike a favorable balance between generalization and robustness, while context size and tokenization significantly affect results. The method correlates with established benchmarks like HumanEval and MMLU, offering a scalable, contamination-resistant metric that informs model evaluation and future research into domain-generalization and multi-modal compression.

Abstract

Paper Structure (21 sections, 3 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 21 sections, 3 equations, 5 figures, 6 tables, 2 algorithms.

Introduction
Background
Language Models Evaluation
Compression and Language Models
Our Method
Experiment
Data Collection
Models and Metrics
Results
Generalization and Training Cutoff
Model Comparison
Comparison to Established Benchmarks
Generalization on Different Data Sources
Context Size and Performance
Tokenization and Performance
...and 6 more sections

Figures (5)

Figure 1: The correlation between model compression rate (%, lower is better) and their cutoff date. The cutoff for LLaMA and Llama-2 is 2020 (estimated) and September 2022, respectively (see §\ref{['models_and_metrics']} for details).
Figure 2: (a) Compression rate on wikitext; (b) Robustness (gap between $\mathrm{Rate_{23}}$ and $\mathrm{Rate_{17-22}}$) and performance ($\mathrm{Rate_{23}}$), tested on Wikitext. InternLM and CodeLlama are excluded from (a) for the sake of figure readability.
Figure 3: The differences in monthly Wikipedia data.
Figure 4: Compression rates change over time on different domains.
Figure 5: The relation between performance and robustness on GitHub code (left) and BBC news (right).

Evaluating Large Language Models for Generalization and Robustness via Data Compression

TL;DR

Abstract

Evaluating Large Language Models for Generalization and Robustness via Data Compression

Authors

TL;DR

Abstract

Table of Contents

Figures (5)