Point Prediction for Streaming Data
Aleena Chanda, N. V. Vinodchandran, Bertrand Clarke
TL;DR
The paper tackles point prediction for streaming data under ${\cal M}$-open conditions, where no true data-generating model is assumed. It introduces two novel predictors: a hash-function based Count-Min sketch predictor (HBP) that estimates an empirical distribution function in one pass, and a Gaussian process predictor with a random additive bias to avoid misleading convergence in nonparametric settings, along with Shtarkov and Dirichlet process baselines for comparison. Through theoretical results on consistency, error bounds, and streaming convergence, and extensive empirical comparisons across rainfall and sensor datasets, the one-pass CMS-based median predictor frequently delivers the strongest cumulative $L^1$ performance on complex data, with GP-based methods offering competitive alternatives in other regimes. The work demonstrates scalable, model-agnostic prediction tools for streaming data where traditional stochastic modeling may be inappropriate, highlighting a practical matching between data complexity and predictor sophistication."
Abstract
We present two new approaches for point prediction with streaming data. One is based on the Count-Min sketch (CMS) and the other is based on Gaussian process priors with a random bias. These methods are intended for the most general predictive problems where no true model can be usefully formulated for the data stream. In statistical contexts, this is often called the $\mathcal{M}$-open problem class. Under the assumption that the data consists of i.i.d samples from a fixed distribution function $F$, we show that the CMS-based estimates of the distribution function are consistent. We compare our new methods with two established predictors in terms of cumulative $L^1$ error. One is based on the Shtarkov solution (often called the normalized maximum likelihood) in the normal experts setting and the other is based on Dirichlet process priors. These comparisons are for two cases. The first is one-pass meaning that the updating of the predictors is done using the fact that the CMS is a sketch. For predictors that are not one-pass, we use streaming $K$-means to give a representative subset of fixed size that can be updated as data accumulate. Preliminary computational work suggests that the one-pass median version of the CMS method is rarely outperformed by the other methods for sufficiently complex data. We also find that predictors based on Gaussian process priors with random biases perform well. The Shtarkov predictors we use here did not perform as well probably because we were only using the simplest example. The other predictors seemed to perform well mainly when the data did not look like they came from an M-open data generator.
