Categorical Distributions are Effective Neural Network Outputs for Event Prediction
Kevin Doran, Tom Baden
TL;DR
The paper proposes using a fixed, N-bin categorical output to represent a piecewise-constant density for next-event prediction, enabling effective modeling of both discrete-time and continuous-time event sequences. By interpreting the head's logits through a quantile-based interval scheme, the method handles mixed inter-event-time distributions and is particularly strong on datasets with discrete structure, such as NYC taxi data. Through analyses across real-world, synthetic, and novel spike-prediction tasks, the authors show that dataset size and task structure—rather than model size alone—drive performance, with larger models benefitting mainly when abundant data are available. They also introduce synthetic modulo-datasets and a broken Metropolis sampler to probe scaling behavior and provide evidence that more data can unlock benefits of categorical outputs, offering practical guidance for selecting output representations in temporal point processes. The work highlights the importance of aligning output structure with the underlying data and task, and it provides actionable benchmarks for evaluating next-event prediction models on large-scale data.
Abstract
We demonstrate the effectiveness of the categorical distribution as a neural network output for next event prediction. This is done for both discrete-time and continuous-time event sequences. To model continuous-time processes, the categorical distribution is interpreted as a piecewise-constant density function and is shown to be competitive across a range of datasets. We then argue for the importance of studying discrete-time processes by introducing a neuronal spike prediction task motivated by retinal prosthetics, where discretization of event times is consequent on the task description. Separately, we show evidence that commonly used datasets favour smaller models. Finally, we introduce new synthetic datasets for testing larger models, as well as synthetic datasets with discrete event times.
