Predicting integers from continuous parameters
Bas Maat, Peter Bloem
TL;DR
The paper tackles predicting integer-valued labels from neural network features by modeling outputs with truly discrete distributions whose parameters are continuous. It introduces five distribution families—three established discretizations ($dnormal$, $dlaplace$, $dweibull$) and three novel discrete analogues (Dalap, Danorm, Bitwise)—and evaluates them across tabular, sequential, and image-generation tasks. Dalap and Bitwise deliver strongest negative log-likelihood performance, while Danorm often yields superior RMSE; continuous relaxations excel on RMSE but lack true discrete probabilities. The results demonstrate that discrete probabilistic modeling can be competitive with standard continuous relaxations in tasks requiring integer outputs, offering practical advantages for sequence and image modeling where discrete outputs are essential.
Abstract
We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the parameters of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: Bitwise, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.
