Table of Contents
Fetching ...

The Ray Tracing Sampler: Bayesian Sampling of Neural Networks for Everyone

Peter Behroozi

TL;DR

This work introduces the Ray Tracing Sampler, a Bayesian MCMC method that propagates parameter-space rays through a medium with refractive index $n(x)=\mathcal{L}(x)^{1/(D-1)}$, yielding constant-speed trajectories whose radiance tracks the likelihood. By conserving radiance and étendue, the method achieves fair sampling even with imperfect integrators and across likelihood barriers, while remaining highly resilient to stochastic gradients. The framework unifies prior methods (HMC, Langevin, Gibbs, Metropolis, etc.) as special cases under generalized ray tracing, and demonstrates scalable posterior sampling for neural networks from thousands to 1.5 billion parameters, including GPT-2-scale models on consumer hardware. Empirically, ray tracing matches HMC in low-noise settings but outperforms it under stochastic gradients, enabling practical Bayesian uncertainty quantification for large-scale neural networks with notable implications for model reliability and architecture design.

Abstract

We derive a Markov Chain Monte Carlo sampler based on following ray paths in a medium where the refractive index $n(x)$ is a function of the desired likelihood $\mathcal{L}(x)$. The sampling method propagates rays at constant speed through parameter space, leading to orders of magnitude higher resilience to heating for stochastic gradients as compared to Hamiltonian Monte Carlo (HMC), as well as the ability to cross any likelihood barrier, including holes in parameter space. Using the ray tracing method, we sample the posterior distributions of neural network outputs for a variety of different architectures, up to the 1.5 billion-parameter GPT-2 (Generative Pre-trained Transformer 2) architecture, all on a single consumer-level GPU. We also show that prior samplers including traditional HMC, microcanonical HMC, Metropolis, Gibbs, and even Monte Carlo integration are special cases within a generalized ray tracing framework, which can sample according to an arbitrary weighting function. Public code and documentation for C, JAX, and PyTorch are available at https://bitbucket.org/pbehroozi/ray-tracing-sampler/src

The Ray Tracing Sampler: Bayesian Sampling of Neural Networks for Everyone

TL;DR

This work introduces the Ray Tracing Sampler, a Bayesian MCMC method that propagates parameter-space rays through a medium with refractive index , yielding constant-speed trajectories whose radiance tracks the likelihood. By conserving radiance and étendue, the method achieves fair sampling even with imperfect integrators and across likelihood barriers, while remaining highly resilient to stochastic gradients. The framework unifies prior methods (HMC, Langevin, Gibbs, Metropolis, etc.) as special cases under generalized ray tracing, and demonstrates scalable posterior sampling for neural networks from thousands to 1.5 billion parameters, including GPT-2-scale models on consumer hardware. Empirically, ray tracing matches HMC in low-noise settings but outperforms it under stochastic gradients, enabling practical Bayesian uncertainty quantification for large-scale neural networks with notable implications for model reliability and architecture design.

Abstract

We derive a Markov Chain Monte Carlo sampler based on following ray paths in a medium where the refractive index is a function of the desired likelihood . The sampling method propagates rays at constant speed through parameter space, leading to orders of magnitude higher resilience to heating for stochastic gradients as compared to Hamiltonian Monte Carlo (HMC), as well as the ability to cross any likelihood barrier, including holes in parameter space. Using the ray tracing method, we sample the posterior distributions of neural network outputs for a variety of different architectures, up to the 1.5 billion-parameter GPT-2 (Generative Pre-trained Transformer 2) architecture, all on a single consumer-level GPU. We also show that prior samplers including traditional HMC, microcanonical HMC, Metropolis, Gibbs, and even Monte Carlo integration are special cases within a generalized ray tracing framework, which can sample according to an arbitrary weighting function. Public code and documentation for C, JAX, and PyTorch are available at https://bitbucket.org/pbehroozi/ray-tracing-sampler/src

Paper Structure

This paper contains 47 sections, 52 equations, 13 figures, 1 table, 1 algorithm.

Figures (13)

  • Figure 1: Example sampling dynamics for ray tracing and Hamiltonian Monte Carlo (HMC) for a 2D Gaussian mixture, with $\mathcal{L}(\mathbf{x})=[\exp(-0.5(x_1-2)^2)+\exp(-0.5(x_1+2)^2)]\exp(-0.5x_2^2)$. For this case, as well as any other 2D distribution, fair sampling with ray tracing occurs when the spatially-varying refractive index is the same as the likelihood function (i.e., $n(\mathbf{x})=\mathcal{L}(\mathbf{x})$). This figure illustrates the bending of light toward higher-probability regions due to Snell's law for a ray bundle originating from the white dot in the left of the figure. Notably, the rays can explore a wide range of likelihood values. In contrast, particle orbits starting from the same initial conditions using HMC are limited by the initial kinetic energy, chosen for this illustration to be the mean kinetic energy. The background color scale corresponds to the underlying probability distribution function, scaled by its maximum value.
  • Figure 2: Refraction of light for an arbitrary dimensionality $D$. This figure shows an incoming beam of light with radiance $L_1$ in a medium with refractive index $n_1$, crossing a differential area $dA$ at the interface with another medium with refractive index $n_2$, resulting in an outgoing beam of light with radiance $L_2$. By Snell's law, $n_1 \sin \theta_1 = n_2 \sin \theta_2$, where $\theta_1$ and $\theta_2$ are the respective angles to the interface normal vector (in green). If $n_2$ is larger than $n_1$, the outgoing beam of light is compressed not only in the $\hat{\theta_2}$ direction (by Snell's law), but also along the directions perpendicular to $\hat{\theta}_2$, similar to how lines of longitude are compressed as one moves closer to the Earth's poles. In addition, the outgoing beam becomes closer to the interface normal vector, increasing the projected area ($\cos\theta_2 dA$) from which it appears to originate. These two effects combine to result in an outgoing radiance $L_2$ that is a factor $(n_2/n_1)^{D-1}$ larger than the incoming radiance $L_1$ (Eq. \ref{['e:radiance2']}).
  • Figure 3: Parameters describing rays propagating from a differential emission area $dA_E$ to a differential receiving area $dA_R$. The angles between the rays and the normal vectors of the emitter and receiver are $\theta_E$ and $\theta_R$, respectively. The differential solid angle encompassed by the rays at the emitter is $d\Omega_E$, and that encompassed by the rays at the receiver is $d\Omega_R$. As discussed in the text, the étendue at the emitter ($d^2G_E \equiv n_E^{D-1} dA_E \cos\theta_E d\Omega_E$) is equal to the étendue at the receiver ($d^2G_R\equiv n_R^{D-1} dA_R \cos\theta_R d\Omega_R$).
  • Figure 4: Path dynamics of Hamiltonian Monte Carlo. A particle moving from potential $U_1$ to $U_2$ experiences a force normal to the potential boundary, causing it to accelerate. The velocity component perpendicular to the force, $\mathbf{v}_\perp$, is unaffected. Hence, we have the geometrical identity $|\mathbf{v}_1| \sin\theta_1 = |\mathbf{v}_\perp| = |\mathbf{v}_2| \sin\theta_2$, and the path is identical to that of a light ray traveling in a medium with refractive index $n = |\mathbf{v}|$. Since the magnitude of $\mathbf{v}$ is only a function of the potential energy for a given choice of total energy $E$, this results in a well-defined solution for $n$ for all particle paths at fixed total energy.
  • Figure 5: This figure shows ray tracing samples for a 10,000--dimensional Gaussian likelihood ($\ln \mathcal{L}=-0.5|\mathbf{x}|^2$), using a moderate step size equivalent to a change in propagation direction of $\langle \Delta\phi\rangle =0.25$ at each step. The non-Metropolis cases both converge to stable distributions that are biased, so a Metropolis test can be helpful to achieve larger step sizes when exact likelihoods are used.
  • ...and 8 more figures