Table of Contents
Fetching ...

LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression (Technical Report)

Guozhong Li, Muhannad Alhumaidi, Spiros Skiadopoulos, Panos Kalnis

TL;DR

This work addresses the challenge of efficiently compressing large-scale, high-resolution scientific data with strict fidelity guarantees. It introduces LLMComp, a decoder-only transformer-based framework that tokenizes 3D spatiotemporal fields via Z-order flattening and Lloyd-Max quantization, augments tokens with spatiotemporal coordinates, and trains to predict next tokens in an autoregressive fashion. Compression relies on a top-$k$ prediction scheme with fallback corrections to ensure exact rebuilds within the error bound, and employs coverage-guided sampling to improve training efficiency. Across ERA5 and RedSea datasets, LLMComp achieves up to 30% higher compression ratios than state-of-the-art compressors under tight error bounds, demonstrating the potential of decoder-only LLMs as general-purpose, error-bounded data compressors for scientific applications.

Abstract

The rapid growth of high-resolution scientific simulations and observation systems is generating massive spatiotemporal datasets, making efficient, error-bounded compression increasingly important. Meanwhile, decoder-only large language models (LLMs) have demonstrated remarkable capabilities in modeling complex sequential data. In this paper, we propose LLMCOMP, a novel lossy compression paradigm that leverages decoder-only large LLMs to model scientific data. LLMCOMP first quantizes 3D fields into discrete tokens, arranges them via Z-order curves to preserve locality, and applies coverage-guided sampling to enhance training efficiency. An autoregressive transformer is then trained with spatial-temporal embeddings to model token transitions. During compression, the model performs top-k prediction, storing only rank indices and fallback corrections to ensure strict error bounds. Experiments on multiple reanalysis datasets show that LLMCOMP consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds. These results highlight the potential of LLMs as general-purpose compressors for high-fidelity scientific data.

LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression (Technical Report)

TL;DR

This work addresses the challenge of efficiently compressing large-scale, high-resolution scientific data with strict fidelity guarantees. It introduces LLMComp, a decoder-only transformer-based framework that tokenizes 3D spatiotemporal fields via Z-order flattening and Lloyd-Max quantization, augments tokens with spatiotemporal coordinates, and trains to predict next tokens in an autoregressive fashion. Compression relies on a top- prediction scheme with fallback corrections to ensure exact rebuilds within the error bound, and employs coverage-guided sampling to improve training efficiency. Across ERA5 and RedSea datasets, LLMComp achieves up to 30% higher compression ratios than state-of-the-art compressors under tight error bounds, demonstrating the potential of decoder-only LLMs as general-purpose, error-bounded data compressors for scientific applications.

Abstract

The rapid growth of high-resolution scientific simulations and observation systems is generating massive spatiotemporal datasets, making efficient, error-bounded compression increasingly important. Meanwhile, decoder-only large language models (LLMs) have demonstrated remarkable capabilities in modeling complex sequential data. In this paper, we propose LLMCOMP, a novel lossy compression paradigm that leverages decoder-only large LLMs to model scientific data. LLMCOMP first quantizes 3D fields into discrete tokens, arranges them via Z-order curves to preserve locality, and applies coverage-guided sampling to enhance training efficiency. An autoregressive transformer is then trained with spatial-temporal embeddings to model token transitions. During compression, the model performs top-k prediction, storing only rank indices and fallback corrections to ensure strict error bounds. Experiments on multiple reanalysis datasets show that LLMCOMP consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds. These results highlight the potential of LLMs as general-purpose compressors for high-fidelity scientific data.

Paper Structure

This paper contains 22 sections, 5 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: The ERA5 dataset: a visualization of ERA5 reanalysis temperature data hersbach2020era5
  • Figure 2: The workflow of LLMComp
  • Figure 3: Distribution of original token and target-aware sampling on ERA5 dataset, (a) The original token distribution is highly skewed, with dense peaks and underrepresented regions, leading to poor coverage under random or uniform sampling. (b) Our target-aware sampling redistributes the frequency more uniformly, ensuring rare but important tokens are adequately learned during training.
  • Figure 4: Autoregressive decompression workflow. Given an initial input, the LLM predicts top-$k$ tokens. The next token is reconstructed based on whether the ground truth is within top-$k$. The updated sequence is used for the next step. Recovered tokens are finally mapped to a 3D temperature volume.
  • Figure 5: Decompression quality (PSNR -- larger is better) vs. efficiency (bit rate -- lower is better).
  • ...and 7 more figures