Table of Contents
Fetching ...

Parselets: An Abstraction for Fast, General-Purpose Algorithmic Information Calculus

François Cayre

TL;DR

The principled design of a theoretical framework leading to fast and accurate algorithmic information measures on finite multisets of finite strings by means of compression is described, which is the native embodiment of Epicurus' Principle on top of Occam's Razor to produce both a most-significant and most-general explicit model for the data.

Abstract

This work describes the principled design of a theoretical framework leading to fast and accurate algorithmic information measures on finite multisets of finite strings by means of compression. One distinctive feature of our approach is to manipulate {\em reified}, explicit representations of the very entities and quantities of the theory itself: compressed strings, models, rate-distortion states, minimal sufficient models, joint and relative complexity. To do so, a programmable, recursive data structure called a {\em parselet} essentially provides modeling of a string as a concatenation of parameterized instantiations from sets of finite strings that encode the regular part of the data. This supports another distinctive feature of this work, which is the native embodiment of Epicurus' Principle on top of Occam's Razor, so as to produce both a most-significant and most-general explicit model for the data. This model is iteratively evolved through the Principle of Minimal Change to reach the so-called minimal sufficient model of the data. Parselets may also be used to compute a compression score to any arbitrary hypothesis about the data. A lossless, rate-distortion oriented, compressed representation is proposed, that allows immediate reusability of the costly computations stored on disk for their fast merging as our core routine for information calculus. Two information measures are deduced: one is exact because it is purely combinatorial, and the other may occasionally incur slight numerical inaccuracies because it is an approximation of the Kolmogorov complexity of the minimal sufficient model. Symmetry of information is enforced at the bit level. Whenever possible, parselets are compared with off-the-shelf compressors on real data. Some other applications just get enabled by parselets.

Parselets: An Abstraction for Fast, General-Purpose Algorithmic Information Calculus

TL;DR

The principled design of a theoretical framework leading to fast and accurate algorithmic information measures on finite multisets of finite strings by means of compression is described, which is the native embodiment of Epicurus' Principle on top of Occam's Razor to produce both a most-significant and most-general explicit model for the data.

Abstract

This work describes the principled design of a theoretical framework leading to fast and accurate algorithmic information measures on finite multisets of finite strings by means of compression. One distinctive feature of our approach is to manipulate {\em reified}, explicit representations of the very entities and quantities of the theory itself: compressed strings, models, rate-distortion states, minimal sufficient models, joint and relative complexity. To do so, a programmable, recursive data structure called a {\em parselet} essentially provides modeling of a string as a concatenation of parameterized instantiations from sets of finite strings that encode the regular part of the data. This supports another distinctive feature of this work, which is the native embodiment of Epicurus' Principle on top of Occam's Razor, so as to produce both a most-significant and most-general explicit model for the data. This model is iteratively evolved through the Principle of Minimal Change to reach the so-called minimal sufficient model of the data. Parselets may also be used to compute a compression score to any arbitrary hypothesis about the data. A lossless, rate-distortion oriented, compressed representation is proposed, that allows immediate reusability of the costly computations stored on disk for their fast merging as our core routine for information calculus. Two information measures are deduced: one is exact because it is purely combinatorial, and the other may occasionally incur slight numerical inaccuracies because it is an approximation of the Kolmogorov complexity of the minimal sufficient model. Symmetry of information is enforced at the bit level. Whenever possible, parselets are compared with off-the-shelf compressors on real data. Some other applications just get enabled by parselets.

Paper Structure

This paper contains 45 sections, 1 theorem, 22 equations, 18 figures, 7 tables, 16 algorithms.

Key Result

Proposition 1

Eq. eq:prob defines a normalized, well-defined, and finitely additive probability measure.

Figures (18)

  • Figure 1: Rate-distortion plots of the mitochondrial DNA sample cat (17009 bases). Lossless model: 188 parselets. Minimal sufficient model: 150 parselets at $r$ = 37.2kb. Last model: 87 parselets. Wall-clock time: 4.8s (full plot).
  • Figure 2: Rate-distortion plots of the english string (10843 letters). Lossless model: 315 parselets. Minimal sufficient model: 259 parselets at $r$ = 36kb. Last model: 80 parselets. Wall-clock time: 4.7s (full plot).
  • Figure 3: Beginning of the algorithmically, optimally denoised english string (decompression of the lossy string associated to the minimal sufficient model of Fig. \ref{['fig:rd:english']}).
  • Figure 4: Excerpt of the minimal sufficient model built for english, in canonical order (206 parselets). Regexp operators appear on the edges (look at the small star on incoming edges to letter 's' below parselets 0x105 and 0x1dd for instance). This is the Hasse diagram of lattice $(\mathcal{D}_{\hat{x}},\preceq)$ (albeit with inverted arrows, and omitting $\top=\underline{x}$ and $\bot=\epsilon$).
  • Figure 5: Applying Alg. \ref{['alg:cond:deflate']} to english with minimal sufficient models from languages. The value of $C(\hbox{english}\Vert\emptyset)$ is reported for reference. Wall-clock time: 0.142s (parallelization: x5.8).
  • ...and 13 more figures

Theorems & Definitions (10)

  • Definition 1: Information measure on string multisets
  • Definition 2: Parselets
  • Definition 3: In/compressible string multisets
  • Definition 4: Shannon information lattice on string multisets
  • Definition 5: Syntactic conditional independence of string multisets
  • Definition 6: Abstract probabilities
  • Proposition 1: $(\Omega, 2^{\mathcal{D}_{\hat{\Omega}}}, \textbf{p})$ is a probability space
  • proof
  • Definition 7: Conditional independence statistics
  • Definition 8: String slot