Information Theory for Expectation Measures
Peter Harremoës
TL;DR
The paper develops an information-divergence framework for expectation measures, enabling analysis of nonrandom data and point-processes within a unified coding perspective. It derives explicit optimal coding rules for empirical texts, notably showing $\ell^*(a)=\ln\left(\frac{\|\mu^*\|}{\mu^*(a)}\right)$ with $\mu^*$ maximizing entropy over convex hulls of empirical measures, thereby connecting minimal description length to maximum-entropy principles. A Poisson interpretation then links $D(\mu||\nu)$ to divergences between Poisson processes $Po(\mu)$ and $Po(\nu)$, yielding a chain rule that separates sample-size uncertainty from letter-uncertainty and enabling information projection and reverse information projection analyses. Overall, the work ties Kraft’s inequality, MDL, and scoring rules to a generalized theory of information for expectation measures, with implications for nonstochastic data modeling and point-process applications.
Abstract
Shannon based his information theory on the notion of probability measures as it we developed by Kolmogorov. In this paper we study some fundamental problems in information theory based on expectation measures. In the theory of expectation measures it is natural to study data sets where no randomness is present and it is also natural to study information theory for point processes as well as sampling where the sample size is not fixed. Expectation measures in combination with Kraft's Inequality can be used to clarify in which cases probability measures can be used to quantify randomness.
