Table of Contents
Fetching ...

Improving the evaluation of samplers on multi-modal targets

Louis Grenioux, Maxence Noble, Marylou Gabrié

TL;DR

The paper addresses the challenge of evaluating samplers on multi-modal targets, where mode discovery and accurate mode-weight estimation are hard in high dimensions. It proposes a synthetic, dimension- and separation-controlled benchmark based on a bi-modal Gaussian mixture and introduces a mode-weight metric to quantify sampler performance, enabling interpretable diagnostics across methods. Across local MCMC, importance sampling, variational inference, annealed methods, and diffusion-based approaches, the study finds that annealing-based samplers (SMC, Replica Exchange) robustly recover mode proportions in moderate settings, while diffusion-based methods (SLIPS, DDS) show promise with careful tuning, whereas vanilla MCMC/IS/VI struggle as separation and dimension grow. The framework offers practical insights for diagnosing sampler strengths and guiding robust, scalable development for multi-modal sampling tasks.

Abstract

Addressing multi-modality constitutes one of the major challenges of sampling. In this reflection paper, we advocate for a more systematic evaluation of samplers towards two sources of difficulty that are mode separation and dimension. For this, we propose a synthetic experimental setting that we illustrate on a selection of samplers, focusing on the challenging criterion of recovery of the mode relative importance. These evaluations are crucial to diagnose the potential of samplers to handle multi-modality and therefore to drive progress in the field.

Improving the evaluation of samplers on multi-modal targets

TL;DR

The paper addresses the challenge of evaluating samplers on multi-modal targets, where mode discovery and accurate mode-weight estimation are hard in high dimensions. It proposes a synthetic, dimension- and separation-controlled benchmark based on a bi-modal Gaussian mixture and introduces a mode-weight metric to quantify sampler performance, enabling interpretable diagnostics across methods. Across local MCMC, importance sampling, variational inference, annealed methods, and diffusion-based approaches, the study finds that annealing-based samplers (SMC, Replica Exchange) robustly recover mode proportions in moderate settings, while diffusion-based methods (SLIPS, DDS) show promise with careful tuning, whereas vanilla MCMC/IS/VI struggle as separation and dimension grow. The framework offers practical insights for diagnosing sampler strengths and guiding robust, scalable development for multi-modal sampling tasks.

Abstract

Addressing multi-modality constitutes one of the major challenges of sampling. In this reflection paper, we advocate for a more systematic evaluation of samplers towards two sources of difficulty that are mode separation and dimension. For this, we propose a synthetic experimental setting that we illustrate on a selection of samplers, focusing on the challenging criterion of recovery of the mode relative importance. These evaluations are crucial to diagnose the potential of samplers to handle multi-modality and therefore to drive progress in the field.

Paper Structure

This paper contains 39 sections, 19 equations, 5 figures.

Figures (5)

  • Figure 1: Density of the bi-modal Gaussian mixture used in our experiments ($d = 2$) with increasing values of $a$. As expected, a larger value of $a$ makes the modes further from each other, which reinforces the sampling difficulty.
  • Figure 2: Results on mode weight estimation for the bi-modal Gaussian mixture defined in \ref{['subsec:def_target']}, when varying hyperparameters $a$ and $d$. For each setting, we aggregate 48 Monte Carlo estimations, each one being computed over 8192 samples. Hashed areas indicate settings with systematic mode collapse in the sampling process. (Left): Averaged absolute error of the estimation with respect to the true mode weight $w\approx 66.7\%$. (Right): Standard deviation of the estimation.
  • Figure 3: Same as \ref{['fig:mode_weight']} for the best performing SMC, RE and SLIPS reaching higher-dimensions. For sake of pedagogy, the color scale is different from above (a consistent color scale can be found in \ref{['fig:mode_weight_high_dim_orig_scale']} of \ref{['app:implem']}). (Left): Averaged absolute error of the estimation with respect to the true mode weight $w\approx 66.7\%$. (Right): Standard deviation of the estimation.
  • Figure 4: Same as \ref{['fig:mode_weight']} for the best performing SMC, RE and SLIPS in higher-dimensions. Note that the color scale is the same as in \ref{['fig:mode_weight']}. (Left): Averaged absolute error of the estimation with respect to the true mode weight $w\approx 66.7\%$. (Right): Standard deviation of the estimation.
  • Figure 5: Average wall-clock computing time of each sampling algorithm. The results were averaged on $48 \times 25$ runs ($48$ Monte Carlo estimations for each of the $25$ different mode separation/dimension settings).