Table of Contents
Fetching ...

Mind the Gap: Navigating Inference with Optimal Transport Maps

Malte Algren, Tobias Golling, Francesco Armando Di Bello, Christopher Pollard

TL;DR

This work tackles misspecification in simulation-based inference by introducing an optimal transport–based horizontal calibration that acts directly in a high-dimensional latent space. Using a Transformer-based jet classifier trained on the JetClass dataset, the authors derive a conditional OT map $\hat{T}_\omega$ to align the simulated latent distribution $p_{\text{sim}}(z_{128}|\omega)$ with the data distribution $p_{\text{data}}(z_{128}|\omega)$, implemented via two ICNNs to optimize the $W_2^2$ objective and yielding $\hat{T}_\omega(z|\omega)=\nabla_z g(z|\omega)$. They show that this horizontal calibration reduces discrepancies in the latent space and in downstream discriminants, while vertical calibration would suffer sample-dilution issues in high dimensions. The results demonstrate that calibrated latent representations enable robust, unbiased jet-tagging in high-energy physics and establish a scalable framework for calibrating large, pretrained or foundation-like models in physics applications. The approach has broad applicability for correcting high-dimensional simulations across the sciences, potentially enabling reliable integration of complex models with real data.

Abstract

Machine learning (ML) techniques have recently enabled enormous gains in sensitivity to new phenomena across the sciences. In particle physics, much of this progress has relied on excellent simulations of a wide range of physical processes. However, due to the sophistication of modern machine learning algorithms and their reliance on high-quality training samples, discrepancies between simulation and experimental data can significantly limit their effectiveness. In this work, we present a solution to this ``misspecification'' problem: a model calibration approach based on optimal transport, which we apply to high-dimensional simulations for the first time. We demonstrate the performance of our approach through jet tagging, using a dataset inspired by the CMS experiment at the Large Hadron Collider. A 128-dimensional internal jet representation from a powerful general-purpose classifier is studied; after calibrating this internal ``latent'' representation, we find that a wide variety of quantities derived from it for downstream tasks are also properly calibrated: using this calibrated high-dimensional representation, powerful new applications of jet flavor information can be utilized in LHC analyses. This is a key step toward allowing the unbiased use of ``foundation models'' in particle physics. More broadly, this calibration framework has broad applications for correcting high-dimensional simulations across the sciences.

Mind the Gap: Navigating Inference with Optimal Transport Maps

TL;DR

This work tackles misspecification in simulation-based inference by introducing an optimal transport–based horizontal calibration that acts directly in a high-dimensional latent space. Using a Transformer-based jet classifier trained on the JetClass dataset, the authors derive a conditional OT map to align the simulated latent distribution with the data distribution , implemented via two ICNNs to optimize the objective and yielding . They show that this horizontal calibration reduces discrepancies in the latent space and in downstream discriminants, while vertical calibration would suffer sample-dilution issues in high dimensions. The results demonstrate that calibrated latent representations enable robust, unbiased jet-tagging in high-energy physics and establish a scalable framework for calibrating large, pretrained or foundation-like models in physics applications. The approach has broad applicability for correcting high-dimensional simulations across the sciences, potentially enabling reliable integration of complex models with real data.

Abstract

Machine learning (ML) techniques have recently enabled enormous gains in sensitivity to new phenomena across the sciences. In particle physics, much of this progress has relied on excellent simulations of a wide range of physical processes. However, due to the sophistication of modern machine learning algorithms and their reliance on high-quality training samples, discrepancies between simulation and experimental data can significantly limit their effectiveness. In this work, we present a solution to this ``misspecification'' problem: a model calibration approach based on optimal transport, which we apply to high-dimensional simulations for the first time. We demonstrate the performance of our approach through jet tagging, using a dataset inspired by the CMS experiment at the Large Hadron Collider. A 128-dimensional internal jet representation from a powerful general-purpose classifier is studied; after calibrating this internal ``latent'' representation, we find that a wide variety of quantities derived from it for downstream tasks are also properly calibrated: using this calibrated high-dimensional representation, powerful new applications of jet flavor information can be utilized in LHC analyses. This is a key step toward allowing the unbiased use of ``foundation models'' in particle physics. More broadly, this calibration framework has broad applications for correcting high-dimensional simulations across the sciences.

Paper Structure

This paper contains 26 sections, 19 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: The distributions are represented as follows: source distribution (blue) and target distribution (black). (a) shows the change in the transverse impact parameter ($d_0$), while (b) illustrates the change in number of constituents ($N_{constituent}$) between the datasets.
  • Figure 2: The two diagrams show (a) the transformer encoder and (b) the output multi-layer perceptron head (MLP Head). The transformer architecture incorporates a sequence of encoder layers prelayer_norm, each comprising multi-head self-attention mechanisms (MHA) transformer followed by feed-forward neural networks (MLP) shazeer2020gluvariantsimprovetransformer with Feature-wise Linear Modulation (FiLM) FiLMFiLM_newer for conditional integration. The MLP Head consists of a sequence of linear transformations and GELU gelu with a final output size of $10$ for the class probabilities.
  • Figure 3: Corner plots of the first five principal components derived from PCA applied to the $z_{128}$ latent space. The 2D contours show 5%, 50%, 80%, 90%, and 95% percentiles. (a) Comparison between the target distribution (black) and the source distribution (blue). (b) Comparison between the target distribution (black) and the calibrated distribution (red). The calibration is derived in $z_{128}$, and subsequent layers in the MLP Head are applied to map it to the 10-dimensional output space.
  • Figure 4: (a) Receiver operating characteristic (ROC) curves for discriminators $\mathcal{D}_L$ trained to differentiate between source and target distributions across three representational spaces: the 128-dimensional latent space ($z_{128})$, the final 128-dimensional layer in the MLP Head ($z'_{128}$), and the 10-dimensional output space ($z'_{10}$). The discriminators are subsequently evaluated on the calibrated versus target distributions, where an area under curve (AUC) value approximating 0.5 indicates successful mitigation of distributional discrepancies that were previously exploited by the discriminator. (b) Marginal distribution of the discriminators $\mathcal{D}_L$ scores in $z_{128}$, between the source (blue) and target (black) distributions.
  • Figure 5: Comparisons between the marginal distributions of the physics-motivated one-dimensional discriminants. The distributions are represented as follows: target distribution (black), calibrated distribution (red), and source distribution (blue). The calibration is derived on the $z_{128}$ latent space, and then the subsequent MLP Head layers are applied to map it to the output space. Afterwards the output space is projected onto physics-motivated one-dimensional discriminants.
  • ...and 8 more figures