Pcodec: Better Compression for Numerical Sequences
Martin Loncaric, Niels Jeppesen, Ben Zinberg
TL;DR
Pco addresses lossless compression of numerical sequences by representing inputs as latent SIID unsigned integers and applying mode/delta preprocessing followed by a binning-based entropy coder. It proves a theoretical bound showing convergence of binning to the true entropy: $\hat{H} \le H + \frac{3 s \log_2(T)}{k-2s}\frac{T}{T-1}$. The system supports automatic mode and delta encoding selection, plus a DP-based bin optimization and a flexible wrapping format, delivering 29–94% better compression than competing numerical codecs on six real datasets while maintaining fast decompression (often >1 GiB/s per thread). These gains translate to substantial storage savings for large columnar and time-series data, and Pco is already embedded in projects like Zarr and CnosDB; future work includes sharper comparisons to idealized LZ and broader deployment.
Abstract
We present Pcodec (Pco), a format and algorithm for losslessly compressing numerical (float or integer) sequences. Pco's core and most novel component is a binning algorithm that quickly converges to the true entropy of smoothly, independently, and identically distributed (SIID) integers. We mathematically prove this convergence with a practical bound. To accommodate data this is not SIID, Pco has two opinionated preprocessing steps. The first step, Pco's mode, decomposes the numbers into more smoothly distributed integer latent variables. The second step, delta encoding, makes the latents more independently and identically distributed. We demonstrate that Pco achieves 29-94% higher compression ratio than other numerical codecs on six real-world columnar datasets while using less compression time.
