Table of Contents
Fetching ...

Data-Driven High-Dimensional Statistical Inference with Generative Models

Oz Amram, Manuel Szewc

TL;DR

HI-SIGMA introduces a data-driven, high-dimensional inference framework for resonant analyses at the LHC by learning multi-dimensional signal and background densities with generative models and performing unbinned likelihood fits. It factors densities around a resonance as $P_k(oldsymbol{x})=P_k(oldsymbol{x}'|m)P_k(m)$, uses an extended profile likelihood, and interpolates backgrounds from sidebands into the signal region, enabling robust uncertainty quantification. The method demonstrates improved sensitivity over traditional cut-based and low-bin classifier approaches in a di-Higgs $bb\gamma\gamma$ proxy, while maintaining interpretability and a principled treatment of systematic uncertainties via bootstrapping and shape variations. This work highlights the practical potential of data-driven, high-dimensional density estimation for complex final states and multi-parameter inference, with implications for future Higgs measurements and EFT operator constraints.

Abstract

Crucial to many measurements at the LHC is the use of correlated multi-dimensional information to distinguish rare processes from large backgrounds, which is complicated by the poor modeling of many of the crucial backgrounds in Monte Carlo simulations. In this work, we introduce HI-SIGMA, a method to perform unbinned high-dimensional statistical inference with data-driven background distributions. In contradistinction to many applications of Simulation Based Inference in High Energy Physics, HI-SIGMA relies on generative ML models, rather than classifiers, to learn the signal and background distributions in the high-dimensional space. These ML models allow for interpretable inference while also incorporating model errors and other sources of systematic uncertainties. We showcase this methodology on a simplified version of a di-Higgs measurement in the $bbγγ$ final state, where the di-photon resonance allows for background interpolation from sidebands into the signal region. We demonstrate that HI-SIGMA provides improved sensitivity as compared to standard classifier-based methods, and that systematic uncertainties can be straightforwardly incorporated by extending methods which have been used for histogram based analyses.

Data-Driven High-Dimensional Statistical Inference with Generative Models

TL;DR

HI-SIGMA introduces a data-driven, high-dimensional inference framework for resonant analyses at the LHC by learning multi-dimensional signal and background densities with generative models and performing unbinned likelihood fits. It factors densities around a resonance as , uses an extended profile likelihood, and interpolates backgrounds from sidebands into the signal region, enabling robust uncertainty quantification. The method demonstrates improved sensitivity over traditional cut-based and low-bin classifier approaches in a di-Higgs proxy, while maintaining interpretability and a principled treatment of systematic uncertainties via bootstrapping and shape variations. This work highlights the practical potential of data-driven, high-dimensional density estimation for complex final states and multi-parameter inference, with implications for future Higgs measurements and EFT operator constraints.

Abstract

Crucial to many measurements at the LHC is the use of correlated multi-dimensional information to distinguish rare processes from large backgrounds, which is complicated by the poor modeling of many of the crucial backgrounds in Monte Carlo simulations. In this work, we introduce HI-SIGMA, a method to perform unbinned high-dimensional statistical inference with data-driven background distributions. In contradistinction to many applications of Simulation Based Inference in High Energy Physics, HI-SIGMA relies on generative ML models, rather than classifiers, to learn the signal and background distributions in the high-dimensional space. These ML models allow for interpretable inference while also incorporating model errors and other sources of systematic uncertainties. We showcase this methodology on a simplified version of a di-Higgs measurement in the final state, where the di-photon resonance allows for background interpolation from sidebands into the signal region. We demonstrate that HI-SIGMA provides improved sensitivity as compared to standard classifier-based methods, and that systematic uncertainties can be straightforwardly incorporated by extending methods which have been used for histogram based analyses.

Paper Structure

This paper contains 25 sections, 15 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Distributions of the signal (red) and background in the sideband region (blue) and signal regions (green). In the top plot the dashed red lines denote the boundaries of the signal region defined by $m_{\gamma\gamma}$. The last bin shows overflow entries. The distributions of background events in the signal and sideband regions are seen to be similar but with visible differences, which will be learned by the interpolation aspect of the HI-SIGMA approach.
  • Figure 2: Distributions of the $\Delta R_{bb}$ feature before (left) and after (right) the smearing procedure. The smearing softens the sharp edge at 0.4. It incurs some information loss but does not significantly impact signal versus background discrimination.
  • Figure 3: Events generated from a background model (single bootstrap) as compared to true events in the signal region. To spot potential minor mismodelings, the comparison is performed using samples an order of magnitude larger than those used for the inference example. Good agreement is observed overall, with only minor discrepancies visible in the tails.
  • Figure 4: Distributions of the $p_{T}^{bb}$ feature for $-\sigma$, nominal and $\sigma$ variations of the shape systematic. The other features are assumed to be unaffected.
  • Figure 5: Illustrations of the original (blue) and distorted (green) feature distributions of the background sample. The distorted sample is used to train classifiers, approximating data-MC modeling imperfections. See text for details.
  • ...and 8 more figures