A Hitchhiker's Guide to Poisson Gradient Estimation

Michael Ibrahim; Hanqi Zhao; Eli Sennesh; Zhi Li; Anqi Wu; Jacob L. Yates; Chengrui Li; Hadi Vafaii

A Hitchhiker's Guide to Poisson Gradient Estimation

Michael Ibrahim, Hanqi Zhao, Eli Sennesh, Zhi Li, Anqi Wu, Jacob L. Yates, Chengrui Li, Hadi Vafaii

TL;DR

This work tackles the challenge of differentiating through Poisson-distributed latent variables, a common scenario in neuroscience-inspired models. It compares Eat and Gsm relaxations and introduces Eat_cubic, a cubic Hermite-based relaxation with compact support that yields unbiased first moments for τ ≤ 1 and improved distributional fidelity. Leveraging Campbell's theorem, the authors derive closed-form expressions for the Eat moments and demonstrate through theory and experiments that Eat_cubic better preserves Poisson statistics (especially mean and variance) while remaining robust to temperature choices; it often matches exact gradients in downstream ELBO performance on Poisson VAE and POGLM tasks. They also provide a nuanced gradient analysis showing that distributional fidelity and gradient quality capture complementary facets of relaxation performance, and conclude with practical recommendations for practitioners on choosing and tuning Poisson relaxations. Overall, the results highlight distributional fidelity as a crucial factor and offer a robust, temperature-insensitive option for Poisson gradient estimation with broad NeuroAI applicability.

Abstract

Poisson-distributed latent variable models are widely used in computational neuroscience, but differentiating through discrete stochastic samples remains challenging. Two approaches address this: Exponential Arrival Time (EAT) simulation and Gumbel-SoftMax (GSM) relaxation. We provide the first systematic comparison of these methods, along with practical guidance for practitioners. Our main technical contribution is a modification to the EAT method that theoretically guarantees an unbiased first moment (exactly matching the firing rate), and reduces second-moment bias. We evaluate these methods on their distributional fidelity, gradient quality, and performance on two tasks: (1) variational autoencoders with Poisson latents, and (2) partially observable generalized linear models, where latent neural connectivity must be inferred from observed spike trains. Across all metrics, our modified EAT method exhibits better overall performance (often comparable to exact gradients), and substantially higher robustness to hyperparameter choices. Together, our results clarify the trade-offs between these methods and offer concrete recommendations for practitioners working with Poisson latent variable models.

A Hitchhiker's Guide to Poisson Gradient Estimation

TL;DR

Abstract

Paper Structure (122 sections, 96 equations, 15 figures, 2 tables, 3 algorithms)

This paper contains 122 sections, 96 equations, 15 figures, 2 tables, 3 algorithms.

Introduction
Our contributions.
Background and related work
Probabilistic latent variable models.
Variational inference and ELBO.
Optimizing ELBO: score-based vs. pathwise.
The Exponential Arrival Time (Eat) relaxation.
The Gumbel-Softmax (Gsm) relaxation.
The difficulty of auto-tuning temperature.
Results
Improving the distributional fidelity of Eat via cubic Hermite interpolation
Point processes and Campbell's theorem.
Closed-form expressions for the Eat moments.
Eat$_{\textsf{sigmoid}}$ exhibits substantial bias.
The Eat$_{\textsf{cubic}}$ approximation is unbiased.
...and 107 more sections

Figures (15)

Figure 1: Empirical validation of moment biases across relaxation methods. We compare Eat$_{\textsf{sigmoid}}$ (blue), Eat$_{\textsf{cubic}}$ (red), and Gsm (green) across temperatures $\tau \in [0.02, 0.5]$ and firing rates ${\color{color_enc}{\lambda}}\xspace \in \{2, 20, 100\}$ (columns). Top row: Ratio of empirical mean to true Poisson mean (${\color{color_enc}{\lambda}}\xspace$). Bottom row: Ratio of empirical variance to true Poisson variance (${\color{color_enc}{\lambda}}\xspace$). Dashed lines indicate ideal Poisson fidelity (ratio $= 1$). Shaded regions show $\pm 1$ standard error over 100 trials. Eat$_{\textsf{cubic}}$ (red) maintains near-perfect mean fidelity across all conditions and substantially better variance fidelity than both alternatives, confirming our theoretical predictions. Eat$_{\textsf{sigmoid}}$ (blue) exhibits severe bias, particularly at higher rates, with variance collapsing to below 20% of the true value at ${\color{color_enc}{\lambda}}\xspace = 100$, $\tau = 0.5$. See \ref{['fig:dist_moments_all_rates']} for results across all tested ${\color{color_enc}{\lambda}}\xspace$.
Figure 2: Wasserstein-1 distance from true Poisson across relaxation methods. We compare Eat$_{\textsf{sigmoid}}$ (blue), Eat$_{\textsf{cubic}}$ (red), and Gsm (green) across temperatures $\tau \in [0.02, 0.5]$ and firing rates ${\color{color_enc}{\lambda}}\xspace \in \{2, 20, 100\}$ (columns). Lower values indicate better distributional fidelity; the dashed line at zero represents a perfect match. Eat$_{\textsf{cubic}}$ consistently achieves the lowest $W_1$ distance across all conditions, with the gap widening substantially at higher rates. At ${\color{color_enc}{\lambda}}\xspace = 100$, $\tau = 0.5$, Eat$_{\textsf{cubic}}$ achieves 7$\times$ lower distance than Eat$_{\textsf{sigmoid}}$. See \ref{['fig:dist_wasser_all_rates']} for results across all tested ${\color{color_enc}{\lambda}}\xspace$.
Figure 3: Gradient quality analysis across relaxation methods. We compare gradient estimates from Eat$_{\textsf{sigmoid}}$ (blue), Eat$_{\textsf{cubic}}$ (red), Gsm (green), and score function with baseline (gray) across temperatures $\tau \in [0.02, 0.5]$ at firing rate ${\color{color_enc}{\lambda}}\xspace = 20$. (a)$\mathrm{CosMean}$ (\ref{['eq:cos-mean']}): cosine similarity between expected gradient and ground truth. (b)$\mathrm{CosSample}$ (\ref{['eq:cos-sample']}): average cosine similarity of individual gradient samples. (c)$\mathrm{BiasEnergy}$ (\ref{['eq:bias-energy-main']}): curvature-weighted squared bias. (d)$\mathrm{NoiseEnergy}$ (\ref{['eq:noise-energy-main']}): curvature-weighted noise variance. Both energy metrics are normalized by $\bm{g}^{*\top} \mathbf{H} \bm{g}^*$, providing an interpretable scale: values $\lesssim 0.1$ are negligible, while values $\gtrsim 1$ indicate errors dominate the true gradient. Both Eat methods achieve near-perfect directional alignment and low bias ($\sim\!10^{-2}$) and noise ($\sim\!10^{0}$) across all temperatures (but Eat$_{\textsf{sigmoid}}$ bias degrades as temperature increases). Gsm shows elevated $\mathrm{BiasEnergy}$ at low temperatures and higher $\mathrm{NoiseEnergy}$ throughout. The score function baseline performs poorly across all metrics. Shaded regions indicate $\pm 1$ standard deviation across $N=100$ Monte Carlo samples. See \ref{['fig:grad_analysis_all_rates']} for results across other rates and \ref{['fig:grad_raw_bias_var']} for raw bias and variance.
Figure 4: Validation ELBO across relaxation methods and temperatures.Top: Linear $\mathop{\mathrm{\mathcal{P}}}\nolimits$-VAE trained on whitened natural image patches. Bottom: POGLM on synthetic neural data. Left columns: Score function baseline across MC sample sizes. Right columns: Pathwise methods across temperatures $\tau \in \{0.02, 0.05, 0.1, 0.2, 0.5\}$. For $\mathop{\mathrm{\mathcal{P}}}\nolimits$-VAE, the black marker indicates exact gradients (closed-form ELBO for linear decoders), representing an upperbound. Eat$_{\textsf{cubic}}$ (red) achieves near-optimal performance consistently across all temperatures, while Eat$_{\textsf{sigmoid}}$ (blue) and Gsm (green) degrade substantially at $\tau = 0.5$. The score function baseline (gray) exhibits high variance and poor performance despite variance reduction; we can also see increased performance when the number of Monte Carlo sample sizes gets bigger. Error bars indicate $\pm 1$ std across three random seeds; open circles show individual runs. See \ref{['fig:decoder_weights']} for the learned representations, and \ref{['fig:poglm_metrics_all']} for POGLM results across a denser sampling of $\tau \in [0.02, 0.5]$, including weight recovery results.
Figure 5: Schematic of the Eat sampling process. Given rate $\lambda$, independent inter-arrival times $\Delta t_i \sim \operatorname{Exp}(\lambda)$ are cumulatively summed to determine absolute arrival times (right). The Poisson sample $z$ is defined as the count of events arriving before the unit time horizon (vertical dashed line). In this example, the first three events fall within the window, while the fourth exceeds it, resulting in a generated sample of $z=3$. This process visualizes the steps detailed in \ref{['algo:eat:rsample']}.
...and 10 more figures

A Hitchhiker's Guide to Poisson Gradient Estimation

TL;DR

Abstract

A Hitchhiker's Guide to Poisson Gradient Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)