From Letters to Words and Back: Invertible Coding of Stationary Measures
Łukasz Dębowski
TL;DR
This work constructs an invertible, stationarity- and ergodicity-preserving mapping between probability measures on word- and letter-like infinite sequences via self-avoiding codes, termed the normalized transport. The authors show global and trajectory-based formulations that recover and invert each other, connect to asymptotically mean stationary measures, and preserve ergodic decompositions. They prove ergodicity of successive recurrence times under ergodic measures and derive a precise relation between entropy rates across the transport, tying the entropy to the average code-length via $h_Y= h_X / \int L_g dP_X$. The framework offers a principled bridge between textual representations and provides tools for analyzing information rates and recurrence phenomena in stationary processes, with potential applications in statistical language modeling and formal linguistics.
Abstract
Motivated by problems of statistical language modeling, we consider probability measures on infinite sequences over two countable alphabets of a different cardinality, such as letters and words. We introduce an invertible mapping between such measures, called the normalized transport, that preserves both stationarity and ergodicity. The normalized transport applies so called self-avoiding codes that generalize comma-separated codes and specialize bijective stationary codes. The normalized transport is also connected to the usual measure transport via underlying asymptotically mean stationary measures. It preserves the ergodic decomposition. The normalized transport and self-avoiding codes arise in the problem of successive recurrence times. We show that successive recurrence times are ergodic for an ergodic measure, which strengthens a result by Chen Moy from 1959. We also relate the entropy rates of processes linked by the normalized transport.
