Table of Contents
Fetching ...

Poisson Log-Normal Process for Count Data Prediction

Anushka Saha, Abhijith Gandrakota, Alexandre V. Morozov

TL;DR

PoLoN addresses the challenge of modeling non-negative count data with uncertainty by placing a Gaussian process prior on Poisson log-rates $\alpha(\vec{X}) = e^{\lambda(\vec{X})}$, leading to a Poisson-LogNormal predictive distribution. It extends to PoLoN-SB to explicitly model localized signals within a smoothly varying background and validates the approach on synthetic 1D/2D data, a bike rental dataset, and open Higgs data from the LHC, obtaining statistically meaningful signal extractions. The framework relies on Laplace approximations for tractable inference and kernel-based hyperparameter optimization via L-BFGS-B, providing a robust, non-parametric alternative for rate prediction, interpolation, and signal extraction with quantified uncertainty. Overall, PoLoN broadens the applicability of Gaussian processes to count-based scientific data, offering accurate de-noising, rate reconstruction, and principled signal detection in discrete observations.

Abstract

Modeling count data is important in physics and other scientific disciplines, where measurements often involve discrete, non-negative quantities such as photon or neutrino detection events. Traditional parametric approaches can be trained to generate integer-count predictions but may struggle with capturing complex, non-linear dependencies often observed in the data. Gaussian process (GP) regression provides a robust non-parametric alternative to modeling continuous data; however, it cannot generate integer outputs. We propose the Poisson Log-Normal (PoLoN) process, a framework that employs GP to model Poisson log-rates. As in GP regression, our approach relies on the correlations between data points captured via GP kernel structure rather than explicit functional parameterizations. We demonstrate that the PoLoN predictive distribution is Poisson-LogNormal and provide an algorithm for optimizing kernel hyperparameters. Furthermore, we adapt the PoLoN approach to the problem of detecting weak localized signals superimposed on a smoothly varying background - a task of considerable interest in many areas of science and engineering. Our framework allows us to predict the strength, location and width of the detected signals. We evaluate PoLoN's performance using both synthetic and real-world datasets, including the open dataset from CERN which was used to detect the Higgs boson at the Large Hadron Collider. Our results indicate that the PoLoN process can be used as a non-parametric alternative for analyzing, predicting, and extracting signals from integer-valued data.

Poisson Log-Normal Process for Count Data Prediction

TL;DR

PoLoN addresses the challenge of modeling non-negative count data with uncertainty by placing a Gaussian process prior on Poisson log-rates , leading to a Poisson-LogNormal predictive distribution. It extends to PoLoN-SB to explicitly model localized signals within a smoothly varying background and validates the approach on synthetic 1D/2D data, a bike rental dataset, and open Higgs data from the LHC, obtaining statistically meaningful signal extractions. The framework relies on Laplace approximations for tractable inference and kernel-based hyperparameter optimization via L-BFGS-B, providing a robust, non-parametric alternative for rate prediction, interpolation, and signal extraction with quantified uncertainty. Overall, PoLoN broadens the applicability of Gaussian processes to count-based scientific data, offering accurate de-noising, rate reconstruction, and principled signal detection in discrete observations.

Abstract

Modeling count data is important in physics and other scientific disciplines, where measurements often involve discrete, non-negative quantities such as photon or neutrino detection events. Traditional parametric approaches can be trained to generate integer-count predictions but may struggle with capturing complex, non-linear dependencies often observed in the data. Gaussian process (GP) regression provides a robust non-parametric alternative to modeling continuous data; however, it cannot generate integer outputs. We propose the Poisson Log-Normal (PoLoN) process, a framework that employs GP to model Poisson log-rates. As in GP regression, our approach relies on the correlations between data points captured via GP kernel structure rather than explicit functional parameterizations. We demonstrate that the PoLoN predictive distribution is Poisson-LogNormal and provide an algorithm for optimizing kernel hyperparameters. Furthermore, we adapt the PoLoN approach to the problem of detecting weak localized signals superimposed on a smoothly varying background - a task of considerable interest in many areas of science and engineering. Our framework allows us to predict the strength, location and width of the detected signals. We evaluate PoLoN's performance using both synthetic and real-world datasets, including the open dataset from CERN which was used to detect the Higgs boson at the Large Hadron Collider. Our results indicate that the PoLoN process can be used as a non-parametric alternative for analyzing, predicting, and extracting signals from integer-valued data.
Paper Structure (9 sections, 47 equations, 10 figures)

This paper contains 9 sections, 47 equations, 10 figures.

Figures (10)

  • Figure 1: PoLoN predictions for 1D integer-count data. Panel (a) shows exact Poisson rates $\alpha(x)$ that follow a linear trend with superimposed oscillations (red curve). Training set datapoints are shown as yellow dots. Blue curve: mean of the predictive PLN distribution, light blue shaded area: 95% confidence interval (CI). Panel (b) shows the training set datapoints (yellow dots) and the modes of the predictive PLN distributions for $W=500$ values of $x$ equally spaced in the training dataset range. The modes represents maximum posterior probability (MAP) predictions. Panels (c-d) -- same as (a-b) but for the exponentially decaying $\alpha(x)$.
  • Figure 2: Kernel hyperparameter optimization. Contour-overlaid heatmap of the posterior log-likelihood (Eq. \ref{['L:marginal:Laplace']}) for the training dataset generated by $\alpha(x)$ which consists of a linear trend with superimposed oscillations (Fig. \ref{['fig:lin_sin_2']}a). The optimal (Maximum Likelihood Estimation, or MLE) hyperparameters $\sigma^\star = 6.999$ and $\gamma^\star = 11.880$, found using the L-BFGS-B algorithm, are marked with a red dot.
  • Figure 3: PoLoN predictions for 2D integer-count data. Panel (a) shows the two-dimensional Poisson rate function $\alpha(x, y)$ with a linear trend in the $x$-direction and oscillations in the $y$-direction, for $x,y \in [1, 30]$. The training datapoints are shown in red. Panel (b) shows a comparison between predicted and actual Poisson rates for $225$$(x,y)$ pairs in the test set. The solid red line is the "$x=y$" line; the dashed green line is the least-squares fit. The coefficient of determination is $R^2 = 0.982$.
  • Figure 4: Normalized RMSE as a function of $N_p$, the number of datapoints per input feature $x$ in the training dataset. Panel (a) shows $\epsilon$ for PoLoN predictions (Section \ref{['sec:methode_1']}). Panel (b) shows $\epsilon$ for PoLoN-SB predictions (Section \ref{['sec:methode_2']}). For each combination of $N_p$ and $S$, the normalized RMSE $\epsilon$ in Eq. \ref{['rmse:2s']} was averaged over $10$ independently generated training datasets.
  • Figure 5: Relative percentage error in signal parameter predictions. Panel (a) shows the error in the predicted signal strength $S$, panel (b) shows the error in the predicted mean of the signal $q$, and panel (c) shows the error in the predicted standard deviation of the signal $u$. We used PoLoN-SB (Section \ref{['sec:methode_2']}) to predict the signal parameters. For each combination of $N_p$ and $S$, the relative percentage error (computed as $\frac{|predicted - true|}{true} \times 100 \%$) was averaged over $10$ independently generated training datasets.
  • ...and 5 more figures