Table of Contents
Fetching ...

The LZ78 Source

Naomi Sagan, Amir Dembo, Matthew Ho, Tsachy Weissman

TL;DR

The paper introduces the LZ78-based probability source $Q^{\text{LZ}, \Uppi}$, generated by LZ78 sequential probability assignments with a prior $\Uppi$ on PMFs. It establishes foundational entropic properties: the entropy rate is $\mathbb{E}_{\Uppi}[H(\Upsilon)]$, and an SMB-type limit holds for the normalized log-likelihood, while finite-state log-loss equals $H(\mathbb{E}[\Upsilon])$, revealing a Jensen gap relative to the entropy rate. The authors develop a rigorous empirical-measure framework showing that the zero-order empirical distribution of the $\mathbf{B}$ process converges to $\Uppi$ and that $r$-tuple distributions converge to deterministic i.i.d.-like laws, enabling analysis of non-Markovian, non-stationary data. They also conduct simulations and transformer-based ICL experiments to illustrate the theory and to benchmark sequential models against Markov and CTW baselines on non-Markovian data, including genomic sequences. Collectively, the work provides a principled benchmark for evaluating sequential probability models and offers insights into in-context learning when encountering long-range, non-stationary dependencies.

Abstract

We study a family of processes generated according to sequential probability assignments induced by the LZ78 universal compressor. We characterize entropic and distributional properties such as their entropy and relative entropy rates, finite-state compressibility and log loss of their realizations, and the empirical distributions that they induce. Though not quite stationary, these sources are "almost stationary and ergodic;" similar to stationary and ergodic processes, they satisfy a Shannon-McMillan-Breiman-type property: the normalized log probability of their realizations converges almost surely to their entropy rate. Further, they are locally "almost i.i.d." in the sense that the finite-dimensional empirical distributions of their realizations converge almost surely to a deterministic i.i.d. law. However, unlike stationary ergodic sources, the finite-state compressibility of their realizations is almost surely strictly larger than their entropy rate by a "Jensen gap". We present simulations demonstrating the theoretical results. These sources allow to gauge the performance of sequential probability models, both classical and deep learning-based, on non-Markovian non-stationary data. As such, we apply realizations of the LZ78 source to the study of in-context learning in transformer models.

The LZ78 Source

TL;DR

The paper introduces the LZ78-based probability source , generated by LZ78 sequential probability assignments with a prior on PMFs. It establishes foundational entropic properties: the entropy rate is , and an SMB-type limit holds for the normalized log-likelihood, while finite-state log-loss equals , revealing a Jensen gap relative to the entropy rate. The authors develop a rigorous empirical-measure framework showing that the zero-order empirical distribution of the process converges to and that -tuple distributions converge to deterministic i.i.d.-like laws, enabling analysis of non-Markovian, non-stationary data. They also conduct simulations and transformer-based ICL experiments to illustrate the theory and to benchmark sequential models against Markov and CTW baselines on non-Markovian data, including genomic sequences. Collectively, the work provides a principled benchmark for evaluating sequential probability models and offers insights into in-context learning when encountering long-range, non-stationary dependencies.

Abstract

We study a family of processes generated according to sequential probability assignments induced by the LZ78 universal compressor. We characterize entropic and distributional properties such as their entropy and relative entropy rates, finite-state compressibility and log loss of their realizations, and the empirical distributions that they induce. Though not quite stationary, these sources are "almost stationary and ergodic;" similar to stationary and ergodic processes, they satisfy a Shannon-McMillan-Breiman-type property: the normalized log probability of their realizations converges almost surely to their entropy rate. Further, they are locally "almost i.i.d." in the sense that the finite-dimensional empirical distributions of their realizations converge almost surely to a deterministic i.i.d. law. However, unlike stationary ergodic sources, the finite-state compressibility of their realizations is almost surely strictly larger than their entropy rate by a "Jensen gap". We present simulations demonstrating the theoretical results. These sources allow to gauge the performance of sequential probability models, both classical and deep learning-based, on non-Markovian non-stationary data. As such, we apply realizations of the LZ78 source to the study of in-context learning in transformer models.

Paper Structure

This paper contains 49 sections, 26 theorems, 204 equations, 15 figures.

Key Result

Theorem 3.1

Let $\mathbf{X}$ be generated from the LZ78 source $Q^{\text{LZ}, \Uppi}$ with $\mathop{\mathrm{supp}}\nolimits(\Uppi) = \mathcal{M}(\mathcal{A})$. Then, almost surely,

Figures (15)

  • Figure 1: Entropy Rate of the LZ78 Source for a Dirichlet($\gamma \dots \gamma$) prior.
  • Figure 2: Simulation results, where $\Uppi$ is the Dirichlet(0.01, 0.01) distribution.
  • Figure 3: Simulation results, where $\Uppi$ is the Jeffreys prior (Dirichlet(0.5, 0.5)).
  • Figure 4: Simulation results, where $\Uppi$ is the Dirichlet($2, 2$) distribution.
  • Figure 5: Simulation results, where $\Uppi$ is the Dirichlet($0.5, 0.5, 0.5$) distribution.
  • ...and 10 more figures

Theorems & Definitions (95)

  • Definition 2.1: Alphabets and Sequences
  • Remark 2.2: Alphabets in Proofs
  • Definition 2.3: Order of Growth
  • Remark 2.4: Default Logarithmic Base
  • Example 2.6: Realization from the LZ78 Source
  • Remark 2.7: Non-Stationarity of the LZ78 Source
  • Theorem 3.1: A Shannon-McMillan-Breiman-Type Result
  • Theorem 3.2: Entropy Rate
  • Definition 3.3: Markov Sequential Probability Assignment Log Loss
  • Theorem 3.4: Optimal Finite-State and Markov Model Log Loss on $\mathbf{X}$
  • ...and 85 more