Pathwise guessing in categorical time series with unbounded alphabets
J. -R. Chazottes, S. Gallo, D. Takahashi
TL;DR
This paper develops a non-parametric probabilistic guessing framework for categorical time series with potentially unbounded alphabets and long memory. It introduces a data-driven estimator that maximizes conditioned empirical frequencies and provides risk bounds that are independent of the alphabet size, under a general dependence condition captured by $\\Gamma(p)$ and a margin parameter $\\delta_{D,G}$. The authors prove both upper and near-optimal minimax lower bounds, with explicit rates that depend on the margin regime, and show the framework applies to a broad set of models including Markov chains, autoregressive models, Poisson regression, hidden Markov chains, mixtures, and Gibbs measures. The results leverage a DK-W type inequality for dependent sequences and establish exponential convergence in favorable margin regimes, highlighting practical predictability improvements when the alphabet is large or unbounded. Overall, the work provides a principled, non-parametric approach to guessing in complex time-series settings where traditional conditional-probability estimation would be impractical.
Abstract
The following learning problem arises naturally in various applications: Given a finite sample from a categorical or count time series, can we learn a function of the sample that (nearly) maximizes the probability of correctly guessing the values of a given portion of the data using the values from the remaining parts? Unlike classical approaches in statistical inference, our approach avoids explicitly estimating the conditional probabilities. We propose a non-parametric guessing function with a learning rate independent of the alphabet size. Our analysis focuses on a broad class of time series models that encompasses finite-order Markov chains, some hidden Markov chains, Poisson regression for count processes, and one-dimensional Gibbs measures. We provide a margin condition that controls the rate of convergence for the risk. Additionally, we establish a minimax lower bound for the convergence rate of the risk associated with our guessing problem. This lower bound matches the upper bound achieved by our estimator up to a logarithmic factor, demonstrating its near-optimality.
