One flow to correct them all: improving simulations in high-energy physics with a single normalising flow and a switch

Caio Cesar Daumann; Mauro Donega; Johannes Erdmann; Massimiliano Galli; Jan Lukas Späh; Davide Valsecchi

One flow to correct them all: improving simulations in high-energy physics with a single normalising flow and a switch

Caio Cesar Daumann, Mauro Donega, Johannes Erdmann, Massimiliano Galli, Jan Lukas Späh, Davide Valsecchi

TL;DR

The paper tackles mismodellings in Monte Carlo simulations used in high-energy physics by introducing a morphing method based on a single normalising flow conditioned on a boolean IsData. The flow learns a shared base distribution for data and simulation, enabling quantile morphing that maps simulation samples to data space after flipping the conditioning and applying the inverse transform. Validated on both two-dimensional benchmarks and a physics-inspired toy dataset with non-trivial correlations, the approach achieves 1–2% agreement in marginals and substantially improves correlation structure, while rendering data and corrected simulation nearly indistinguishable to a boosted decision tree classifier. The method is simple to train, robust across ancillary variables, and extendable to multi-domain morphing, offering a broadly applicable tool for data-driven MC corrections in high-energy physics and related fields.

Abstract

Simulated events are key ingredients in almost all high-energy physics analyses. However, imperfections in the simulation can lead to sizeable differences between the observed data and simulated events. The effects of such mismodelling on relevant observables must be corrected either effectively via scale factors, with weights or by modifying the distributions of the observables and their correlations. We introduce a correction method that transforms one multidimensional distribution (simulation) into another one (data) using a simple architecture based on a single normalising flow with a boolean condition. We demonstrate the effectiveness of the method on a physics-inspired toy dataset with non-trivial mismodelling of several observables and their correlations.

One flow to correct them all: improving simulations in high-energy physics with a single normalising flow and a switch

TL;DR

Abstract

Paper Structure (12 sections, 3 equations, 11 figures, 1 table)

This paper contains 12 sections, 3 equations, 11 figures, 1 table.

Introduction
Correcting simulations with one normalising flow
Normalising flows for morphing distributions
Two-dimensional benchmarks
Generation of the physics-inspired dataset
Training and results on the physics-inspired dataset
Preprocessing and training
Evaluation of the corrections
Conclusions
Two-dimensional visualisation of the dataset
Details of the dataset generation
Morphing between three domains

Figures (11)

Figure 1: Top: Illustration of the single-flow morphing. The normalising flow is trained to map both data and simulation to the same base distribution. The flow is conditioned on a boolean that encodes whether the input is drawn from simulation or data. Bottom: Illustration of the preservation of quantiles during the morphing from simulation to data space using the base distribution as an intermediary.
Figure 2: The upper plots show the morphing from the checkerboard distribution into the four-circles distribution (left) and into the two-moons distribution (right). The lower plots illustrate the inverted transformation.
Figure 3: Marginal distributions for the seven variables in the data and simulation datasets. The three ancillary variables are shown in the upper figures and the four informative features in the lower figures. The ancillary variables are defined in a way that $p_{\mathrm{T}}$ is unitless, $\eta$ takes only positive values and $N$ is defined in the interval [0,3].
Figure 4: Illustration of the forward pass of the normalising flow for the example of informative feature $v^\mathrm{B}_1$. The four informative features are transformed by the autoregressive structure one at a time. The ancillary variables and the IsData boolean are conditional inputs to the Masked Autoencoder for Distribution Estimation (MADE) neural network, which generates the parameters for the rational quadratic splines that transforms the variables.
Figure 5: Distribution of the informative feature $v^\mathrm{B}_2$ before (left) and after the smoothing and the logarithmic transformation (right) for nominal simulation and data, normalised to unit area. The first bin in the distribution on the right includes the underflow. The last bin in both distributions includes the overflow.
...and 6 more figures

One flow to correct them all: improving simulations in high-energy physics with a single normalising flow and a switch

TL;DR

Abstract

One flow to correct them all: improving simulations in high-energy physics with a single normalising flow and a switch

Authors

TL;DR

Abstract

Table of Contents

Figures (11)