Table of Contents
Fetching ...

Llamazip: Leveraging LLaMA for Lossless Text Compression and Training Dataset Detection

Sören Dréano, Derek Molloy, Noel Murphy

TL;DR

Llamazip tackles the dual challenges of efficient lossless text compression and provenance detection for training data in large language models. It leverages LLaMA3's next-token predictions to store only mispredicted tokens, with tunable quantization and context window settings guiding compression efficiency. The study demonstrates superior compression ratios across diverse English texts compared with several baselines, analyzes how quantization and window size shape performance, and presents evidence that compression behavior may indicate training-data membership. This approach offers practical benefits for archival storage and raises a novel lever for data provenance and transparency in LLM training, while acknowledging limitations related to dataset transparency, language scope, and potential contamination of post-release text.

Abstract

This work introduces Llamazip, a novel lossless text compression algorithm based on the predictive capabilities of the LLaMA3 language model. Llamazip achieves significant data reduction by only storing tokens that the model fails to predict, optimizing storage efficiency without compromising data integrity. Key factors affecting its performance, including quantization and context window size, are analyzed, revealing their impact on compression ratios and computational requirements. Beyond compression, Llamazip demonstrates the potential to identify whether a document was part of the training dataset of a language model. This capability addresses critical concerns about data provenance, intellectual property, and transparency in language model training.

Llamazip: Leveraging LLaMA for Lossless Text Compression and Training Dataset Detection

TL;DR

Llamazip tackles the dual challenges of efficient lossless text compression and provenance detection for training data in large language models. It leverages LLaMA3's next-token predictions to store only mispredicted tokens, with tunable quantization and context window settings guiding compression efficiency. The study demonstrates superior compression ratios across diverse English texts compared with several baselines, analyzes how quantization and window size shape performance, and presents evidence that compression behavior may indicate training-data membership. This approach offers practical benefits for archival storage and raises a novel lever for data provenance and transparency in LLM training, while acknowledging limitations related to dataset transparency, language scope, and potential contamination of post-release text.

Abstract

This work introduces Llamazip, a novel lossless text compression algorithm based on the predictive capabilities of the LLaMA3 language model. Llamazip achieves significant data reduction by only storing tokens that the model fails to predict, optimizing storage efficiency without compromising data integrity. Key factors affecting its performance, including quantization and context window size, are analyzed, revealing their impact on compression ratios and computational requirements. Beyond compression, Llamazip demonstrates the potential to identify whether a document was part of the training dataset of a language model. This capability addresses critical concerns about data provenance, intellectual property, and transparency in language model training.

Paper Structure

This paper contains 32 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Compression of alice29 depending on the context window and the quantization method
  • Figure 2: Compression of Frankenstein depending on the context window and the quantization method
  • Figure 3: Compression of FSDFSD depending on the context window and the quantization method
  • Figure 4: Compression of the LLaMAgen file depending on the context window and the quantization method
  • Figure 5: Compression of the Mistralgen file depending on the context window and the quantization method
  • ...and 2 more figures