Tracking Changing Probabilities via Dynamic Learners

Omid Madani

Tracking Changing Probabilities via Dynamic Learners

Omid Madani

TL;DR

This paper tackles online probabilistic multiclass prediction over unbounded streams with strict memory limits, addressing external and internal nonstationarity by separating salient predictions from noise. It introduces Sparse Moving Averages (SMAs), notably the sparse EMA and a queue-based Qs predictor, and proposes DYAL, a hybrid that dynamically blends per-predictand EMA learning with queue-based switching to adapt rapidly to changes. The evaluation framework uses bounded log-loss for NS items and proper scoring principles to compare open-ended predictors under nonstationarity, demonstrating that per-predictand learning rates and a DYAL combination yield faster adaptation and lower variance than single-rate EMA or queue methods in many regimes. The findings have practical impact for lifelong, continual learning systems and real-world data streams where concepts emerge and evolve, enabling robust probability estimates for salient items while maintaining bounded memory usage.

Abstract

Consider a predictor, a learner, whose input is a stream of discrete items. The predictor's task, at every time point, is probabilistic multiclass prediction, i.e. to predict which item may occur next by outputting zero or more candidate items, each with a probability, after which the actual item is revealed and the predictor updates. To output probabilities, the predictor keeps track of the proportions of the items it has seen. The stream is unbounded (lifelong), and the predictor has finite limited space. The task is open-ended: the set of items is unknown to the predictor and their totality can also grow unbounded. Moreover, there is non-stationarity: the underlying frequencies of items may change, substantially, from time to time. For instance, new items may start appearing and a few recently frequent items may cease to occur again. The predictor, being space-bounded, need only provide probabilities for those items which, at the time of prediction, have sufficiently high frequency, i.e., the salient items. This problem is motivated in the setting of Prediction Games, a self-supervised learning regime where concepts serve as both the predictors and the predictands, and the set of concepts grows over time, resulting in non-stationarities as new concepts are generated and used. We design and study a number of predictors, sparse moving averages(SMAs), for the task. One SMA adapts the sparse exponentiated moving average and another is based on queuing a few counts, keeping dynamic per-item histories. Evaluating the predicted probabilities, under noise and non-stationarity, presents challenges, and we discuss and develop evaluation methods, one based on bounding log-loss. We show that a combination of ideas, supporting dynamic predictand-specific learning rates, offers advantages in terms of faster adaption to change (plasticity), while also supporting low variance (stability).

Tracking Changing Probabilities via Dynamic Learners

TL;DR

Abstract

Paper Structure (89 sections, 21 theorems, 29 equations, 30 figures, 18 tables)

This paper contains 89 sections, 21 theorems, 29 equations, 30 figures, 18 tables.

Introduction
Preliminaries: Problem Setting, Notation, and Evaluation
Probabilities (PR s), Distributions (DI s), and Semi-distributions (SD s)
Generating Sequences: an Idealized Stream
Salient and Noise (NS) Items, and Generating with Noise
Binary Sequences, and the Stationary Binary Setting
Examples and Possibilities
Prediction Techniques: Sparse Moving Averages (SMAs)
Evaluating Probabilistic Predictors
Deviation Rates: when True PRs are Known
Unknown True PRs: Proper Scoring
Scoring Semi-Distributions (SD s)
On the Sensitivity of log-loss
Developing log-loss for NS (Noise) Items
Notes on Pseudocode
...and 74 more sections

Key Result

Lemma 1

Given DI $\mathcal{P}$ and SD $\mathcal{W}$, defined over the same finite set $\mathcal{I}$,

Figures (30)

Figure 1: (a) An example sequence of items (the capital letters) together with several prediction outputs of a hypothetical predictor, in rectangular boxes, shown for a few time points (not all outputs are shown to avoid clutter). The sequence is observed from left to right, thus at time $t=1$, item $B$ is observed ($o^{(1)}=$$B$), and respectively at times $2$, $3$, and $4$, items $A$, $J$, and again $A$ are observed ($o^{(3)}=$$J$, $o^{(4)}=$$A$, etc). A prediction output is a map of item to probability (PR), and can be empty. Mathematically, it is a semi-distribution SD (Sect. \ref{['sec:sds']}). At each time point, before the observation, the predictor predicts, i.e. provides a SD (zero or more items, each with a PR). In this example, at times $t \le 4$, nothing is predicted (empty maps, or $\mathcal{W}^{(1)}=\mathcal{W}^{(4)}=\{\}$, and only two empty outputs, at $t=1$ and $t=2$, are shown). At $t=8$, $\mathcal{W}^{(8)}$ is predicted, where $\mathcal{W}^{(8)} = \{$$A$:0.55, $B$:0.2$\}$ ( i.e. $A$ is predicted with PR 0.55, and $B$ with PR 0.2). (b) The input sequence can be imagined as being generated by a SD $\mathcal{P}$: at each time point, for the next entry of the sequence, an item is drawn, iid (Sect. \ref{['sec:ideal']}). However, the SD $\mathcal{P}$ changes from time to time, such as certain item(s) being removed and new item(s) inserted in $\mathcal{P}$. In the above example, in changing from the left (initial) $\mathcal{P}^{(1)}$ to the right distribution, $\mathcal{P}^{(2)}$ (at $t=700$), $A$ is dropped (becomes 0 PR), while $H$ and $C$ are inserted, and $W$ increases in PR while $B$ is unchanged.
Figure 2: (a) Online processing a stream or sequence means repeating the prequential predict-observe-update cycle (b) An SMA converts a stream of item observations, $[o]_{}^{}$, to a stream of predictions, $[\mathcal{W}]_{}^{}$, or $[o]_{1}^{N} \rightarrow [\mathcal{W}]_{1}^{N}$.
Figure 3: Functions used in evaluating the probabilities. (a) CapAndFilter() is applied to the output of any predictor, at every time $t$, before evaluation, performing filtering (dropping small PRs below $p_{min}$) and, if necessary, explicit capping, i.e. normalizing or scaling down, meaning that the final output will be a SD $\mathcal{W}'$, where $\hbox{a}(\mathcal{W}')\le 1-p_{NS}$ (or $\hbox{u}(\mathcal{W}')\ge p_{NS}$). $p_{NS}=p_{min}=0.01$ in experiments. (b) Scoring via log-loss, handling NS items (bounded log-loss).
Figure 4: (a) A simple NS-marker, a referee to mark an item NS or not, via a count map. In particular we used the Box technique of Sect. \ref{['sec:box']} (often with no limit on history size). (b) Venn diagrams, with background noise, are useful in picturing how sequences are generated, e.g. in synthetic experiments. Here, as we go from left to right in generating a lowest achievable loss plot of Fig. \ref{['fig:lowest_loss']}(c), three Venn digrams of the underlying SDs are shown for the case of two salient items (left $\mathcal{P}^{(1)}=\{\}$ while right $\mathcal{P}^{(3)}$ could be $\{$$A$:$0.25,$$B$:$0.25\}$). (c) Optimal (lowest achievable) log-loss using the LogLossNS() function of Fig. \ref{['code:norming_etc']}, as the PR of $k$ salient items, all equally likely, is increased to maximum possible ($1/k$), from left to right. The lowest loss is (near) 0 when any observed item is noise (on the left) or, on the right, when there is a single salient item with PR 1.0. Maximum loss, of any predictor (not just the optimum), never exceeds $-\ln(p_{min})$ ($\approx 4.6$ in this paper, when $p_{min}=0.01$), and for the optimum here, it is reached when there are $k\approx\frac{1}{p_{min}}$ salient items, each with max PR $\approx p_{min}$.
Figure 5: Pseudo code of (a) sparse EMA with a single learning rate $\beta$, either fixed ("static" EMA), or (b) decayed with a harmonic schedule down to a minimum $\beta_{min}$ ("harmonic" EMA, Sect. \ref{['sec:harm']}). The working of an EMA update can be split in two steps (Fig. \ref{['fig:ema_phases']}): 1) weaken, i.e. weaken all existing edge weights (entries of the map $EmaMap$), and 2) strengthen, i.e. boost (the weight of) the edge to the observed item (target). The map entry is created if it doesn't already exist (edge insertion). Initially (at $t=1$), there are no edges. EMA enjoys a number of desirable properties, such as the probabilities in $EmaMap$ forming a SD, and approximate convergence (Sect. \ref{['sec:ema']}).
...and 25 more figures

Theorems & Definitions (40)

Definition 1
Lemma 1
Corollary 1
Definition 2
Lemma 2
Theorem 1
Lemma 3
Lemma 4
proof
Corollary 2
...and 30 more

Tracking Changing Probabilities via Dynamic Learners

TL;DR

Abstract

Tracking Changing Probabilities via Dynamic Learners

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (30)

Theorems & Definitions (40)