Point Prediction for Streaming Data

Aleena Chanda; N. V. Vinodchandran; Bertrand Clarke

Point Prediction for Streaming Data

Aleena Chanda, N. V. Vinodchandran, Bertrand Clarke

TL;DR

The paper tackles point prediction for streaming data under ${\cal M}$-open conditions, where no true data-generating model is assumed. It introduces two novel predictors: a hash-function based Count-Min sketch predictor (HBP) that estimates an empirical distribution function in one pass, and a Gaussian process predictor with a random additive bias to avoid misleading convergence in nonparametric settings, along with Shtarkov and Dirichlet process baselines for comparison. Through theoretical results on consistency, error bounds, and streaming convergence, and extensive empirical comparisons across rainfall and sensor datasets, the one-pass CMS-based median predictor frequently delivers the strongest cumulative $L^1$ performance on complex data, with GP-based methods offering competitive alternatives in other regimes. The work demonstrates scalable, model-agnostic prediction tools for streaming data where traditional stochastic modeling may be inappropriate, highlighting a practical matching between data complexity and predictor sophistication."

Abstract

We present two new approaches for point prediction with streaming data. One is based on the Count-Min sketch (CMS) and the other is based on Gaussian process priors with a random bias. These methods are intended for the most general predictive problems where no true model can be usefully formulated for the data stream. In statistical contexts, this is often called the $\mathcal{M}$-open problem class. Under the assumption that the data consists of i.i.d samples from a fixed distribution function $F$, we show that the CMS-based estimates of the distribution function are consistent. We compare our new methods with two established predictors in terms of cumulative $L^1$ error. One is based on the Shtarkov solution (often called the normalized maximum likelihood) in the normal experts setting and the other is based on Dirichlet process priors. These comparisons are for two cases. The first is one-pass meaning that the updating of the predictors is done using the fact that the CMS is a sketch. For predictors that are not one-pass, we use streaming $K$-means to give a representative subset of fixed size that can be updated as data accumulate. Preliminary computational work suggests that the one-pass median version of the CMS method is rarely outperformed by the other methods for sufficiently complex data. We also find that predictors based on Gaussian process priors with random biases perform well. The Shtarkov predictors we use here did not perform as well probably because we were only using the simplest example. The other predictors seemed to perform well mainly when the data did not look like they came from an M-open data generator.

Point Prediction for Streaming Data

TL;DR

The paper tackles point prediction for streaming data under

-open conditions, where no true data-generating model is assumed. It introduces two novel predictors: a hash-function based Count-Min sketch predictor (HBP) that estimates an empirical distribution function in one pass, and a Gaussian process predictor with a random additive bias to avoid misleading convergence in nonparametric settings, along with Shtarkov and Dirichlet process baselines for comparison. Through theoretical results on consistency, error bounds, and streaming convergence, and extensive empirical comparisons across rainfall and sensor datasets, the one-pass CMS-based median predictor frequently delivers the strongest cumulative

performance on complex data, with GP-based methods offering competitive alternatives in other regimes. The work demonstrates scalable, model-agnostic prediction tools for streaming data where traditional stochastic modeling may be inappropriate, highlighting a practical matching between data complexity and predictor sophistication."

Abstract

-open problem class. Under the assumption that the data consists of i.i.d samples from a fixed distribution function

, we show that the CMS-based estimates of the distribution function are consistent. We compare our new methods with two established predictors in terms of cumulative

error. One is based on the Shtarkov solution (often called the normalized maximum likelihood) in the normal experts setting and the other is based on Dirichlet process priors. These comparisons are for two cases. The first is one-pass meaning that the updating of the predictors is done using the fact that the CMS is a sketch. For predictors that are not one-pass, we use streaming

-means to give a representative subset of fixed size that can be updated as data accumulate. Preliminary computational work suggests that the one-pass median version of the CMS method is rarely outperformed by the other methods for sufficiently complex data. We also find that predictors based on Gaussian process priors with random biases perform well. The Shtarkov predictors we use here did not perform as well probably because we were only using the simplest example. The other predictors seemed to perform well mainly when the data did not look like they came from an M-open data generator.

Paper Structure (24 sections, 10 theorems, 131 equations, 2 figures, 4 tables)

This paper contains 24 sections, 10 theorems, 131 equations, 2 figures, 4 tables.

Problem Formulation
Hash Function Based Predictors
The HBP Method
A Few Key Properties
Bounds on error and storage
Convergence of the EEDF in probability
A streaming Glivenko-Cantelli theorem.
Bayesian Predictors
No Bias
Random Additive Bias
Dirichlet Process prior prediction
Shtarkov Solution Based Predictors
The Shtarkov solution
The Shtarkov Predictors
Special Cases
...and 9 more sections

Key Result

Theorem 1

$\forall \epsilon > 0 ~ \forall \delta > 0 : \exists \hbox{N} ~ \forall d_K > N$ such that $P(\forall j = 1,\cdots,d_K; \hat{a}_{jk}(n) \leq a_{k}(n)+\epsilon||a||_1 ) \leq \delta$.

Figures (2)

Figure 1: Left: Plot of the Columbia data as a time series. Right: Plot of the Bhubhneshwar data as a time series.
Figure 2: Plot of first two quarters of the accelerometer data.

Theorems & Definitions (21)

Theorem 1
proof
Theorem 2
proof
Theorem 3
proof
Corollary 2.1
proof
Theorem 4
proof
...and 11 more

Point Prediction for Streaming Data

TL;DR

Abstract

Point Prediction for Streaming Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (21)