Table of Contents
Fetching ...

Pcodec: Better Compression for Numerical Sequences

Martin Loncaric, Niels Jeppesen, Ben Zinberg

TL;DR

Pco addresses lossless compression of numerical sequences by representing inputs as latent SIID unsigned integers and applying mode/delta preprocessing followed by a binning-based entropy coder. It proves a theoretical bound showing convergence of binning to the true entropy: $\hat{H} \le H + \frac{3 s \log_2(T)}{k-2s}\frac{T}{T-1}$. The system supports automatic mode and delta encoding selection, plus a DP-based bin optimization and a flexible wrapping format, delivering 29–94% better compression than competing numerical codecs on six real datasets while maintaining fast decompression (often >1 GiB/s per thread). These gains translate to substantial storage savings for large columnar and time-series data, and Pco is already embedded in projects like Zarr and CnosDB; future work includes sharper comparisons to idealized LZ and broader deployment.

Abstract

We present Pcodec (Pco), a format and algorithm for losslessly compressing numerical (float or integer) sequences. Pco's core and most novel component is a binning algorithm that quickly converges to the true entropy of smoothly, independently, and identically distributed (SIID) integers. We mathematically prove this convergence with a practical bound. To accommodate data this is not SIID, Pco has two opinionated preprocessing steps. The first step, Pco's mode, decomposes the numbers into more smoothly distributed integer latent variables. The second step, delta encoding, makes the latents more independently and identically distributed. We demonstrate that Pco achieves 29-94% higher compression ratio than other numerical codecs on six real-world columnar datasets while using less compression time.

Pcodec: Better Compression for Numerical Sequences

TL;DR

Pco addresses lossless compression of numerical sequences by representing inputs as latent SIID unsigned integers and applying mode/delta preprocessing followed by a binning-based entropy coder. It proves a theoretical bound showing convergence of binning to the true entropy: . The system supports automatic mode and delta encoding selection, plus a DP-based bin optimization and a flexible wrapping format, delivering 29–94% better compression than competing numerical codecs on six real datasets while maintaining fast decompression (often >1 GiB/s per thread). These gains translate to substantial storage savings for large columnar and time-series data, and Pco is already embedded in projects like Zarr and CnosDB; future work includes sharper comparisons to idealized LZ and broader deployment.

Abstract

We present Pcodec (Pco), a format and algorithm for losslessly compressing numerical (float or integer) sequences. Pco's core and most novel component is a binning algorithm that quickly converges to the true entropy of smoothly, independently, and identically distributed (SIID) integers. We mathematically prove this convergence with a practical bound. To accommodate data this is not SIID, Pco has two opinionated preprocessing steps. The first step, Pco's mode, decomposes the numbers into more smoothly distributed integer latent variables. The second step, delta encoding, makes the latents more independently and identically distributed. We demonstrate that Pco achieves 29-94% higher compression ratio than other numerical codecs on six real-world columnar datasets while using less compression time.

Paper Structure

This paper contains 23 sections, 3 theorems, 50 equations, 4 figures, 4 tables.

Key Result

Theorem 1

Suppose $X$ is a mixture over domain $\{0, \ldots, T - 1\}$ of $s$ disjoint integer distributions, each of which has a monotonic PMF. Then for any $k>2s$, there exists a binning of at most $k$ bins such that the expected binned bit cost $\hat{H}$ of a random draw from $X$ satisfies where $H$ is the base-2 Shannon entropy of $X$.

Figures (4)

  • Figure 1: The processing steps involved in Pco compression and decompression. By default, compression automatically chooses the mode and delta encoding. Compression must be done one coarse-grained chunk at a time, but can be written out in fine-grained pages. Decompression can be done in even finer-grained batches.
  • Figure 2: Plate notation for the chunk metadata and page components. The wrapping format decides where to place each header, chunk metadata, and page.
  • Figure 3: Empirical Pco compressed size on a synthetic SIID dataset compared with theoretical bounds: the lower bound of the distribution's true entropy and the upper bound from Theorem \ref{['thm:bound']}. One million draws from a Lomax distribution over 64-bit integers were used. Note that our upper bound makes some simplifying approximations and does not account for metadata, but these inaccuracies are very small in practice.
  • Figure 4: Compression characteristics in single- and multi-threaded environments for all codecs on all datasets. In every case, Pco is the Pareto front for its range of compression speeds.

Theorems & Definitions (3)

  • Theorem 1
  • Lemma 1
  • Lemma 2