Prediction from compression for models with infinite memory, with applications to hidden Markov and renewal processes
Yanjun Han, Tianze Jiang, Yihong Wu
TL;DR
This work develops a universal-compression framework to predict the next symbol in sequences generated by processes with long memory, notably Hidden Markov Models and renewal processes. By decomposing the minimax prediction risk into a redundancy term and a memory term, the authors derive tight upper and matching lower bounds, showing that for bounded-state HMMs the optimal KL prediction risk scales as $\Theta\big(\frac{k\ell}{n}\log\frac{n}{k\ell} + \frac{k^2}{n}\log\frac{n}{k^2}\big)$. They provide a polynomial-time estimator achieving the optimal rate when $k$ and $\ell$ are constant, and extend the analysis to Gaussian emissions via a general corollary; for renewal processes the rate is $\Theta(n^{-1/2})$, with non-efficient optimal predictors. The results unify prediction and universal compression, yield practical DP-based algorithms for HMM prediction, and illuminate fundamental trade-offs between memory, redundancy, and computation in sequential prediction problems.
Abstract
Consider the problem of predicting the next symbol given a sample path of length n, whose joint distribution belongs to a distribution class that may have long-term memory. The goal is to compete with the conditional predictor that knows the true model. For both hidden Markov models (HMMs) and renewal processes, we determine the optimal prediction risk in Kullback- Leibler divergence up to universal constant factors. Extending existing results in finite-order Markov models [HJW23] and drawing ideas from universal compression, the proposed estimator has a prediction risk bounded by redundancy of the distribution class and a memory term that accounts for the long-range dependency of the model. Notably, for HMMs with bounded state and observation spaces, a polynomial-time estimator based on dynamic programming is shown to achieve the optimal prediction risk Θ(log n/n); prior to this work, the only known result of this type is O(1/log n) obtained using Markov approximation [Sha+18]. Matching minimax lower bounds are obtained by making connections to redundancy and mutual information via a reduction argument.
