Table of Contents
Fetching ...

On the accuracy of posterior recovery with neural network emulators

H. T. J. Bevins, T. Gessey-Jones, W. J. Handley

TL;DR

This work provides a theoretically grounded bound on information loss when using neural-network emulators in Bayesian inference for cosmological models. By deriving and specializing a KL-divergence bound under Gaussian likelihood and (approximately) linear models, the authors quantify how emulator RMSE relative to data noise controls posterior distortion. They demonstrate the approach in a 21-cm cosmology setting by directly comparing ARES with a globalemu emulator, showing accurate posterior recovery even when RMSE is around 20% of the noise, and reconciling prior concerns about emulator use. The results offer a practical criterion for emulator accuracy and reinforce confidence in emulators as scalable tools for inference in computationally expensive cosmological simulations.

Abstract

Neural network emulators are widely used in astrophysics and cosmology to approximate complex simulations inside Bayesian inference loops. Ad hoc rules of thumb are often used to justify the emulator accuracy required for reliable posterior recovery. We provide a theoretically motivated limit on the maximum amount of incorrect information inferred by using an emulator with a given accuracy. Under assumptions of linearity in the model, uncorrelated noise in the data and a Gaussian likelihood function, we demonstrate that the difference between the true underlying posterior and the recovered posterior can be quantified via a Kullback-Leibler divergence. We demonstrate how this limit can be used in the field of 21-cm cosmology by comparing the posteriors recovered when fitting mock data sets generated with the 1D radiative transfer code ARES directly with the simulation code and separately with an emulator. This paper is partly in response to and builds upon recent discussions in the literature which call into question the use of emulators in Bayesian inference pipelines. Upon repeating some aspects of these analyses, we find these concerns quantitatively unjustified, with accurate posterior recovery possible even when the mean RMSE error for the emulator is approximately 20% of the magnitude of the noise in the data. For the purposes of community reproducibility, we make our analysis code public at this link https://github.com/htjb/validating_posteriors.

On the accuracy of posterior recovery with neural network emulators

TL;DR

This work provides a theoretically grounded bound on information loss when using neural-network emulators in Bayesian inference for cosmological models. By deriving and specializing a KL-divergence bound under Gaussian likelihood and (approximately) linear models, the authors quantify how emulator RMSE relative to data noise controls posterior distortion. They demonstrate the approach in a 21-cm cosmology setting by directly comparing ARES with a globalemu emulator, showing accurate posterior recovery even when RMSE is around 20% of the noise, and reconciling prior concerns about emulator use. The results offer a practical criterion for emulator accuracy and reinforce confidence in emulators as scalable tools for inference in computationally expensive cosmological simulations.

Abstract

Neural network emulators are widely used in astrophysics and cosmology to approximate complex simulations inside Bayesian inference loops. Ad hoc rules of thumb are often used to justify the emulator accuracy required for reliable posterior recovery. We provide a theoretically motivated limit on the maximum amount of incorrect information inferred by using an emulator with a given accuracy. Under assumptions of linearity in the model, uncorrelated noise in the data and a Gaussian likelihood function, we demonstrate that the difference between the true underlying posterior and the recovered posterior can be quantified via a Kullback-Leibler divergence. We demonstrate how this limit can be used in the field of 21-cm cosmology by comparing the posteriors recovered when fitting mock data sets generated with the 1D radiative transfer code ARES directly with the simulation code and separately with an emulator. This paper is partly in response to and builds upon recent discussions in the literature which call into question the use of emulators in Bayesian inference pipelines. Upon repeating some aspects of these analyses, we find these concerns quantitatively unjustified, with accurate posterior recovery possible even when the mean RMSE error for the emulator is approximately 20% of the magnitude of the noise in the data. For the purposes of community reproducibility, we make our analysis code public at this link https://github.com/htjb/validating_posteriors.

Paper Structure

This paper contains 18 sections, 41 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The figure shows the average and 95% absolute difference between the ARES signals and emulated signals in the test data set as a function of redshift. We show the errors for two emulators, one with the preprocessing steps outlined in the original globalemu paper and another without these preprocessing steps, as was done in DJ23. From the figure we can see that the emulator error is larger at lower redshifts where the variation in the signal across the test data is strongest. We can also see that when we do not include the preprocessing steps, the performance of the emulator is worse, in disagreement with DJ23. The rough redshift ranges corresponding to the Dark Ages (grey), Cosmic Dawn (yellow), Epoch of Heating (orange) and Epoch of Reionization (red) are highlighted along with the set of parameters which have the most impact during each epoch. We note that in reality there is a lot of overlap between these different epochs and the processes that govern the signal. We might expect the recovery of constraints on $f_\mathrm{esc}$ and $\log N_\mathrm{HI}$ to be worse than the constraints on the star formation parameters when using the emulators because the error is larger over the EoR window compared to the CD.
  • Figure 2: The posteriors recovered when fitting the fiducial ARES signal directly with ARES in blue and globalemu in orange for $\sigma=25$ mK. The lower half of the triangle plot shows a kernel density estimation (KDE) of the 2D marginalised posteriors, the diagonal shows the 1D KDEs and the upper half shows the samples. We show the fiducial parameter values as red dashed lines for reference, but stress that we are more interested in the similarity between the posteriors in this work. While there are some small difference between the posteriors, they are visually similar. However, given the mean RMSE for the emulator and the limit outlined in \ref{['eq:limit-dkl']} this is not so surprising with the maximum predicted $\mathcal{D}_\mathrm{KL}$ between the emulated and true posteriors being $0.38$ bits. The actual $\mathcal{D}_\mathrm{KL}$ estimated with margarine is $0.05_{-0.52}^{+4.02}$.
  • Figure 3: The graph shows how the upper limit value of $\mathcal{D}_\mathrm{KL}$, defined by \ref{['eq:limit-dkl']}, changes with the standard deviation of the Gaussian random noise in the data and the RMSE error on the emulator. The dashed red line and dashed green line show the contours corresponding to the mean and 95th percentile errors for the globalemu emulator used in this work. From the intersection between these lines and the dotted vertical lines at $\sigma=5, 25, 50$ and 250 mK one can put an approximate upper bound on the $\mathcal{D}_\mathrm{KL}$ between the posterior recovered when using ARES and the emulator. These upper bounds are reported in \ref{['tab:dkl-values']} with the bound from the 95th percentile being more conservative than from the mean RMSE across the test data. The purple scatter points show estimates of the KL divergence between the recovered posteriors for the three different noise levels. We see that even when $\sigma = 5$ mK the emulated posterior is very close to the true posterior recovered with ARES and that the upper limit defined in \ref{['eq:limit-dkl']} while not perfect provides a good gauge on the expected KL for a given emulator error. The KL values shown in purple are also reported in \ref{['tab:dkl-values']}. In this example, the units on $\sigma$ and RMSE are given in mK, but we stress that the discussion in this paper is applicable beyond 21-cm cosmology.
  • Figure 4: The figure shows the recovered posteriors when modelling the data directly with ARES in blue and with globalemu in orange for $\sigma=5$ mK. As with \ref{['fig:25mk-results']} there is a similarity between the two posteriors and although the upper limit on the $\mathcal{D}_\mathrm{KL} = 9.60$ bits the calculated $\mathcal{D}_\mathrm{KL}$ is $0.25_{-0.25}^{+4.45}$.
  • Figure 5: The true posterior recovered with ARES in blue and in orange posterior recovered with the globalemu emulator for $\sigma=50$ mK. As expected from \ref{['eq:limit-dkl']} and \ref{['fig:kl-div']} the posterior distributions look even more alike than when the noise is 5 and 25 mK. The estimated $\mathcal{D}_\mathrm{KL} \leq 0.10$ based on the mean emulator RMSE or $\mathcal{D}_\mathrm{KL} \leq 0.97$ based on the 95th percentile emulator RMSE. The calculated $\mathcal{D}_\mathrm{KL} = 0.09_{-0.03}^{+1.62}$.
  • ...and 3 more figures