Empirical Lossless Compression Bound of a Data Sequence

Lei M Li

Empirical Lossless Compression Bound of a Data Sequence

Lei M Li

TL;DR

This work derives an empirical lossless compression bound for individual data sequences by leveraging the normalized maximum likelihood (NML) coding within exponential-family models and connecting it to predictive and Bayesian mixture codes. Using local asymptotic normality, the author shows the NML code length equals $nH(\hat{\theta}_n)+\frac{d}{2}\log\frac{n}{2\pi}+\log\int_{\Theta}|I(\theta)|^{1/2}d\theta+o(1)$, while the Bayesian mixture code adds $\log\frac{|I(\hat{\theta}_n)|^{1/2}}{w(\hat{\theta}_n)}+o(1)$, underscoring the role of prior and Fisher information in tightening bounds beyond the plug-in entropy. The results are demonstrated for discrete multinomial and continuous cases, with DNA/protein sequences illustrating that codon-structured parsing can achieve compression close to the bound, whereas random-like sequences remain incompressible under these model-based limits. The framework unifies Shannon-style coding, MDL model selection, and LAN-based asymptotics to yield precise, sequence-specific bounds that improve upon $nH(\hat{\theta}_n)$, particularly when the dictionary is large. Practically, this provides a principled way to choose parsing schemes and priors to approach the fundamental lossless compression limit in real data, including biological sequences.

Abstract

We consider the lossless compression bound of any individual data sequence. If we fit the data by a parametric model, the entropy quantity $nH({\hat θ}_n)$ obtained by plugging in the maximum likelihood estimate is an underestimate of the bound, where $n$ is the number of words. Shtarkov showed that the normalized maximum likelihood (NML) distribution or code length is optimal in a minimax sense for any parametric family. We show by the local asymptotic normality that the NML code length for the exponential families is $nH(\hat θ_n) +\frac{d}{2}\log \, \frac{n}{2π} +\log \int_Θ |I(θ)|^{1/2}\, dθ+o(1)$, where $d$ is the model dimension or dictionary size, and $|I(θ)|$ is the determinant of the Fisher information matrix. We also demonstrate that sequentially predicting the optimal code length for the next word via a Bayesian mechanism leads to the mixture code, whose pathwise length is given by $nH({\hat θ}_n) +\frac{d}{2}\log \, \frac{n}{2π} +\log \frac{|\, I({\hat θ}_n)|^{1/2}}{w({\hat θ}_n)}+o(1) $, where $w(θ)$ is a prior. The asymptotics apply to not only discrete symbols but also continuous data if the code length for the former is replaced by the description length for the latter. The analytical result is exemplified by calculating compression bounds of protein-encoding DNA sequences under different parsing models. Typically, the highest compression is achieved when the parsing is in phase of the amino acid codons. On the other hand, the compression rates of pseudo-random sequences are larger than 1 regardless parsing models. These model-based results are in consistency with that random sequences are incompressible as asserted by the Kolmogorov complexity theory. The empirical lossless compression bound is particularly more accurate when dictionary size is relatively large.

Empirical Lossless Compression Bound of a Data Sequence

TL;DR

, while the Bayesian mixture code adds

, underscoring the role of prior and Fisher information in tightening bounds beyond the plug-in entropy. The results are demonstrated for discrete multinomial and continuous cases, with DNA/protein sequences illustrating that codon-structured parsing can achieve compression close to the bound, whereas random-like sequences remain incompressible under these model-based limits. The framework unifies Shannon-style coding, MDL model selection, and LAN-based asymptotics to yield precise, sequence-specific bounds that improve upon

, particularly when the dictionary is large. Practically, this provides a principled way to choose parsing schemes and priors to approach the fundamental lossless compression limit in real data, including biological sequences.

Abstract

We consider the lossless compression bound of any individual data sequence. If we fit the data by a parametric model, the entropy quantity

obtained by plugging in the maximum likelihood estimate is an underestimate of the bound, where

is the number of words. Shtarkov showed that the normalized maximum likelihood (NML) distribution or code length is optimal in a minimax sense for any parametric family. We show by the local asymptotic normality that the NML code length for the exponential families is

, where

is the model dimension or dictionary size, and

is the determinant of the Fisher information matrix. We also demonstrate that sequentially predicting the optimal code length for the next word via a Bayesian mechanism leads to the mixture code, whose pathwise length is given by

, where

is a prior. The asymptotics apply to not only discrete symbols but also continuous data if the code length for the former is replaced by the description length for the latter. The analytical result is exemplified by calculating compression bounds of protein-encoding DNA sequences under different parsing models. Typically, the highest compression is achieved when the parsing is in phase of the amino acid codons. On the other hand, the compression rates of pseudo-random sequences are larger than 1 regardless parsing models. These model-based results are in consistency with that random sequences are incompressible as asserted by the Kolmogorov complexity theory. The empirical lossless compression bound is particularly more accurate when dictionary size is relatively large.

Paper Structure (29 sections, 5 theorems, 50 equations, 3 tables)

This paper contains 29 sections, 5 theorems, 50 equations, 3 tables.

Introduction
A brief review of the key concepts
Data compression
Prefix code
Shannon's probability-based coding
Kolmogorov complexity and algorithm-based coding
Correspondence between probability models and string parsing
Fixed-length and variable-length parsing
Parametric models and complexity
Two references for code length evaluation
Optimality of normalized maximum likelihood code length
Empirical code lengths based on exponential family distributions
Exponential families
Predictive coding
Redundancy
...and 14 more sections

Key Result

Theorem 1

(Empirical optimal source code length) If we fit an individual data sequence by an exponential family distribution, the NML code length is given by where $H(\hat{\theta}_n)$ is the entropy evaluated at the maximum likelihood estimate (MLE) $\hat{\theta}_n=\hat{\theta}(x^{(n)})$, and $|I(\theta)|$ is the determinant of the Fisher information $I(\theta)=[-E(\frac{\partial ^2{log p(X; \theta)}}{{\pa

Theorems & Definitions (5)

Theorem 1
Proposition 1
Theorem 2
Proposition 2
Proposition 3

Empirical Lossless Compression Bound of a Data Sequence

TL;DR

Abstract

Empirical Lossless Compression Bound of a Data Sequence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (5)