Table of Contents
Fetching ...

Predicting integers from continuous parameters

Bas Maat, Peter Bloem

TL;DR

The paper tackles predicting integer-valued labels from neural network features by modeling outputs with truly discrete distributions whose parameters are continuous. It introduces five distribution families—three established discretizations ($dnormal$, $dlaplace$, $dweibull$) and three novel discrete analogues (Dalap, Danorm, Bitwise)—and evaluates them across tabular, sequential, and image-generation tasks. Dalap and Bitwise deliver strongest negative log-likelihood performance, while Danorm often yields superior RMSE; continuous relaxations excel on RMSE but lack true discrete probabilities. The results demonstrate that discrete probabilistic modeling can be competitive with standard continuous relaxations in tasks requiring integer outputs, offering practical advantages for sequence and image modeling where discrete outputs are essential.

Abstract

We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the parameters of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: Bitwise, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.

Predicting integers from continuous parameters

TL;DR

The paper tackles predicting integer-valued labels from neural network features by modeling outputs with truly discrete distributions whose parameters are continuous. It introduces five distribution families—three established discretizations (, , ) and three novel discrete analogues (Dalap, Danorm, Bitwise)—and evaluates them across tabular, sequential, and image-generation tasks. Dalap and Bitwise deliver strongest negative log-likelihood performance, while Danorm often yields superior RMSE; continuous relaxations excel on RMSE but lack true discrete probabilities. The results demonstrate that discrete probabilistic modeling can be competitive with standard continuous relaxations in tasks requiring integer outputs, offering practical advantages for sequence and image modeling where discrete outputs are essential.

Abstract

We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the parameters of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: Bitwise, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.
Paper Structure (37 sections, 8 theorems, 50 equations, 6 figures, 7 tables)

This paper contains 37 sections, 8 theorems, 50 equations, 6 figures, 7 tables.

Key Result

Proposition 1

In the unbounded case, the expected value of $p(n \mid \mu,\gamma)$ is

Figures (6)

  • Figure 1: The principle behind Dalap. The neighbors of $\mu$ are assigned probability mass between $\gamma$ and $1$ by an exponential function of their distance to $\mu$. The rest of the distribution decays geometrically from these two values.
  • Figure 2: The value of the penalty term $\log (\gamma^f + \gamma^c)$ for different values of $\gamma$. As the value of $\gamma$ goes to $0$, non-integer values of $\mu$ are penalized more heavily.
  • Figure 3: The expected value of unbounded Dalap. We take two numbers at a distance of $\frac{\gamma}{1-\gamma}$ below and above $\lfloor\mu\rfloor$ and $\lceil\mu\rceil$ respectively. The expected value is a weighted mean of these with weights proportional to $\gamma^{\color{my-blue} f}$ and $\gamma^{\color{my-green} c}$.
  • Figure 4: Seeded sampling results from mixture models (K=10). The top portion of each image is provided as conditioning, and the model generates the remainder. Dalap produces high-quality completions across all datasets, with performance comparable to Dlogistic (PixelCNN++) with lower log-likelihood. Bitwise exhibits visible artifacts, particularly on CIFAR10, consistent with its poor quantitative performance.
  • Figure 5: Random (unconditional) sampling results on MNIST and FashionMNIST from mixture models (K=10). All images are generated from scratch without any conditioning.
  • ...and 1 more figures

Theorems & Definitions (16)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • Proposition 5
  • proof
  • ...and 6 more