Table of Contents
Fetching ...

Categorical Distributions are Effective Neural Network Outputs for Event Prediction

Kevin Doran, Tom Baden

TL;DR

The paper proposes using a fixed, N-bin categorical output to represent a piecewise-constant density for next-event prediction, enabling effective modeling of both discrete-time and continuous-time event sequences. By interpreting the head's logits through a quantile-based interval scheme, the method handles mixed inter-event-time distributions and is particularly strong on datasets with discrete structure, such as NYC taxi data. Through analyses across real-world, synthetic, and novel spike-prediction tasks, the authors show that dataset size and task structure—rather than model size alone—drive performance, with larger models benefitting mainly when abundant data are available. They also introduce synthetic modulo-datasets and a broken Metropolis sampler to probe scaling behavior and provide evidence that more data can unlock benefits of categorical outputs, offering practical guidance for selecting output representations in temporal point processes. The work highlights the importance of aligning output structure with the underlying data and task, and it provides actionable benchmarks for evaluating next-event prediction models on large-scale data.

Abstract

We demonstrate the effectiveness of the categorical distribution as a neural network output for next event prediction. This is done for both discrete-time and continuous-time event sequences. To model continuous-time processes, the categorical distribution is interpreted as a piecewise-constant density function and is shown to be competitive across a range of datasets. We then argue for the importance of studying discrete-time processes by introducing a neuronal spike prediction task motivated by retinal prosthetics, where discretization of event times is consequent on the task description. Separately, we show evidence that commonly used datasets favour smaller models. Finally, we introduce new synthetic datasets for testing larger models, as well as synthetic datasets with discrete event times.

Categorical Distributions are Effective Neural Network Outputs for Event Prediction

TL;DR

The paper proposes using a fixed, N-bin categorical output to represent a piecewise-constant density for next-event prediction, enabling effective modeling of both discrete-time and continuous-time event sequences. By interpreting the head's logits through a quantile-based interval scheme, the method handles mixed inter-event-time distributions and is particularly strong on datasets with discrete structure, such as NYC taxi data. Through analyses across real-world, synthetic, and novel spike-prediction tasks, the authors show that dataset size and task structure—rather than model size alone—drive performance, with larger models benefitting mainly when abundant data are available. They also introduce synthetic modulo-datasets and a broken Metropolis sampler to probe scaling behavior and provide evidence that more data can unlock benefits of categorical outputs, offering practical guidance for selecting output representations in temporal point processes. The work highlights the importance of aligning output structure with the underlying data and task, and it provides actionable benchmarks for evaluating next-event prediction models on large-scale data.

Abstract

We demonstrate the effectiveness of the categorical distribution as a neural network output for next event prediction. This is done for both discrete-time and continuous-time event sequences. To model continuous-time processes, the categorical distribution is interpreted as a piecewise-constant density function and is shown to be competitive across a range of datasets. We then argue for the importance of studying discrete-time processes by introducing a neuronal spike prediction task motivated by retinal prosthetics, where discretization of event times is consequent on the task description. Separately, we show evidence that commonly used datasets favour smaller models. Finally, we introduce new synthetic datasets for testing larger models, as well as synthetic datasets with discrete event times.

Paper Structure

This paper contains 55 sections, 1 equation, 34 figures, 9 tables.

Figures (34)

  • Figure 1: Bottom: Histogram of inter-event times for the NYC taxi dataset whongNYCsTaxiTrip2014. Top: A categorical distribution mapped to intervals containing $\frac{1}{64}$ of training set events defines a piecewise-constant probability density function.
  • Figure 1: Two models evaluated in terms of test set NLL (lower is better) on existing real-world datasets. The rnn-logmix model from shchurIntensityFreeLearningTemporal2020 acts as a baseline. The rnn-cat model uses the same architecture but with a categorical output structure. Values are means over 10 trials. Results for rnn-cat are reported by their difference to the results for rnn-logmix.
  • Figure 2: Dataset training set sizes. Datasets marked by ⁎ have high Gini coefficient ($> 0.8$), which may indicate a discrete component in the distribution (see \ref{['app:datasets']}). New and extended datasets are introduced in sec:bigger_not_bettersec:modulo_datasets.
  • Figure 3: Performance comparison of 14 models in terms of test set NLL across 8 datasets and 16 training set lengths from $2^{10}$ to $2^{25}$. The colormap ranges span each sub-figure's full range of values, except for the NYC taxi dataset, where the categorical models' very low NLL scores (all $< 3.4$) are separated to preserve the colormap detail for other models. For synthetic datasets where a theoretical best score is known, it is marked with a red dashed line. Each value is calculated from a single run.
  • Figure 4: Histogram of 200k samples from the modified Metropolis-Hastings algorithm.
  • ...and 29 more figures