Table of Contents
Fetching ...

GetDist: a Python package for analysing Monte Carlo samples

Antony Lewis

TL;DR

GetDist provides a KDE-based framework for analyzing Monte Carlo samples from Bayesian inference, addressing weights, correlations, and prior boundaries. It implements boundary-corrected kernels, multiplicative bias correction, and ISJ-DCT bandwidth selection to deliver accurate density estimates and uncertainty quantification. The package supports fast 1D/2D density estimates, convergence diagnostics, and publication-quality plotting and tables, with both scriptable and GUI interfaces. Although motivated by cosmology, the methods are general and broadly applicable to low-dimensional marginalized densities.

Abstract

Monte Carlo techniques, including MCMC and other methods, are widely used in Bayesian inference to generate sets of samples from a parameter space of interest. The Python GetDist package provides tools for analysing these samples and calculating marginalized one- and two-dimensional densities using Kernel Density Estimation (KDE). Many Monte Carlo methods produce correlated and/or weighted samples, for example produced by MCMC, nested, or importance sampling, and there can be hard boundary priors. GetDist's baseline method consists of applying a linear boundary kernel, and then using multiplicative bias correction. The smoothing bandwidth is selected automatically following Botev et al., based on a mixture of heuristics and optimization results using the expected scaling with an effective number of samples (defined here to account for both MCMC correlations and weights). Two-dimensional KDE uses an automatically-determined elliptical Gaussian kernel for correlated distributions. The package includes tools for producing a variety of publication-quality figures using a simple named-parameter interface, as well as a graphical user interface that can be used for interactive exploration. It can also calculate convergence diagnostics, produce tables of limits, and output in LaTeX, and is publicly available.

GetDist: a Python package for analysing Monte Carlo samples

TL;DR

GetDist provides a KDE-based framework for analyzing Monte Carlo samples from Bayesian inference, addressing weights, correlations, and prior boundaries. It implements boundary-corrected kernels, multiplicative bias correction, and ISJ-DCT bandwidth selection to deliver accurate density estimates and uncertainty quantification. The package supports fast 1D/2D density estimates, convergence diagnostics, and publication-quality plotting and tables, with both scriptable and GUI interfaces. Although motivated by cosmology, the methods are general and broadly applicable to low-dimensional marginalized densities.

Abstract

Monte Carlo techniques, including MCMC and other methods, are widely used in Bayesian inference to generate sets of samples from a parameter space of interest. The Python GetDist package provides tools for analysing these samples and calculating marginalized one- and two-dimensional densities using Kernel Density Estimation (KDE). Many Monte Carlo methods produce correlated and/or weighted samples, for example produced by MCMC, nested, or importance sampling, and there can be hard boundary priors. GetDist's baseline method consists of applying a linear boundary kernel, and then using multiplicative bias correction. The smoothing bandwidth is selected automatically following Botev et al., based on a mixture of heuristics and optimization results using the expected scaling with an effective number of samples (defined here to account for both MCMC correlations and weights). Two-dimensional KDE uses an automatically-determined elliptical Gaussian kernel for correlated distributions. The package includes tools for producing a variety of publication-quality figures using a simple named-parameter interface, as well as a graphical user interface that can be used for interactive exploration. It can also calculate convergence diagnostics, produce tables of limits, and output in LaTeX, and is publicly available.

Paper Structure

This paper contains 12 sections, 45 equations, 5 figures.

Figures (5)

  • Figure 1: Example GetDist 'triangle' plot of MCMC parameter samples, here taken from the Planck 2018 baseline cosmological parameter chains PCP2018 (generated using the fast-parameter dragging sampler Lewis:2002ahNeal04Lewis:2013hha). Thinned samples are shown as coloured points, where the colour corresponds to the $\theta_*$ parameter shown in the colour bar (which is marginalized out by projection in the 1 and 2D plots). The 1D plots and 2D density contours containing 68% and 95% of the probability are constructed from all of the samples using kernel density estimates. Although relatively simple unimodal distributions, all the marginalized 2D distributions are somewhat non-Gaussian. There is a hard prior on the parameter $z_{\rm re} > 6.5$ which must be accounted for in the density estimates, and $H_0$ and $\Omega_{\rm m}$ are tightly correlated. In GetDist plots like this can be generated quickly with a single command using a list of input samples and a list of names of parameters to plot.
  • Figure 2: Samples from a truncated Gaussian distribution with $x>1$. The histogram is on the left and density estimates on the right using various different kernels (without multiplicative bias correction). Some form of boundary correction is essential in order not to severely underestimate the density near the boundary. The lowest-order correction removes the leading bias, but tends to underestimate any gradient at the boundary. In the case shown here the first-order correction works better than the zeroth order correction, but this is not guaranteed; higher-order methods can make the result less stable.
  • Figure 3: This figure highlights the importance of boundary correction when estimating densities with prior constraints. It shows 2D density contours for a sample distribution with a sharp linear prior constraint that is not aligned with axes (black dashed line). Without boundary correction (dashed contours), the estimated density incorrectly extends into the excluded region and the density is an artificially suppressed near the boundary due to smoothing. With boundary correction (filled contours), the density accurately reflects the true distribution and contours are correctly estimated right up to the constraint line, demonstrating the effectiveness of the boundary correction. Contours enclose 68% and 95% of the probability.
  • Figure 4: Left: a set of test Gaussian-mixture distributions, comparing the true distribution (red) with the density estimate using 10000 independent samples (blue) using multiplicative bias correction and a linear boundary kernel. The 'Gaussian' panels at the bottom are truncated Gaussian distributions, and all distributions are normalized by the maximum value. Right: scaling of the average integrated squared error $\langle\int {\rm d} x (\Delta f(x))^2\rangle/\int {\rm d} x (f(x))^2$ of the density estimate, where the average is estimated using 1000 sets of 10000 samples for each distribution. The $x$-axis is a scaling relative to the automatically chosen kernel width (e.g. by Eq. \ref{['h_MBC']}), so that one corresponds to the performance with default settings. Lines compare different kernel estimates: solid lines use a multiplicative bias correction (MBC) and linear boundary kernel (black: default, blue: next-order multiplicative bias correction). Dotted is the basic Parzen kernel (for which the kernel-width estimator is optimizing), dot-dashed is with linear boundary correction, and dashed is using a second-order boundary-corrected kernel. The MBC kernel width is suboptimally chosen for Gaussian, where the leading bias term happens to be zero, but about right in many other cases.
  • Figure 5: Left: A set of 2D Gaussian mixture distributions (WJ$x$ labels are from the same test distributions as Ref. Wand93), comparing true density contours (enclosing the 68% and 95% of the probability) with contours from density estimation using one set of 10000 samples. Right: normalized average integrated squared error $\langle\int {\rm d} x (\Delta f(x))^2\rangle/\int {\rm d} x (f(x))^2$ from 500 simulations of 10000 samples, as Fig. \ref{['tests1D']}. In all but two of the trimodal examples the black lines (default is scale width one, with multiplicative bias correction and linear boundary kernel) give substantially lower error than the basic Parzen estimator (red dotted) and are more stable than the higher-order bias correction (blue).