Table of Contents
Fetching ...

Neural Fake Factor Estimation Using Data-Based Inference

Jan Gavranovič, Lara Čalić, Jernej Debevc, Else Lytken, Borut Paul Kerševan

TL;DR

This work tackles the challenge of estimating fake lepton backgrounds in high-energy physics by addressing the limitations of traditional histogram-based Fake Factor methods. It introduces a data-based inference approach that uses neural density-ratio estimation to compute a continuous, per-event fake factor F(\mathbf{x}) in high-dimensional feature spaces, circumventing binning artefacts. The method is validated on an ATLAS Open Data W\to e\nu analysis, showing smoother, more stable extrapolation from control to signal regions and improved modeling in high-dimensional regimes. Overall, the approach enhances the fidelity and flexibility of data-driven background estimation, with potential for application to multi-lepton final states and more complex analyses, while highlighting directions for uncertainty quantification.

Abstract

In a high-energy physics data analysis, the term "fake" backgrounds refers to events that would formally not satisfy the (signal) process selection criteria, but are accepted nonetheless due to mis-reconstructed particles. This can occur, e.g., when leptons from secondary decays are incorrectly identified as originating from the hard-scatter interaction point (known as non-prompt leptons), or when other physics objects, such as hadronic jets, are mistakenly reconstructed as leptons (resulting in mis-identified leptons). These fake leptons are usually estimated using data-driven techniques, one of the most common being the Fake Factor method. This method relies on predicting the fake lepton contribution by reweighting data events, using a scale factor (i.e. fake factor) function. Traditionally, fake factors have been estimated by histogramming and computing the ratio of two data distributions, typically as functions of a few relevant physics variables such as the transverse momentum $p_\text{T}$ and pseudorapidity $η$. In this work, we introduce a novel approach of fake factor calculation, based on density ratio estimation using neural networks trained directly on data in a higher-dimensional feature space. We show that our method enables the computation of a continuous, unbinned fake factor on a per event basis, offering a more flexible, precise, and higher-dimensional alternative to the conventional method, making it applicable to a wide range of analyses. A simple LHC open data analysis we implemented confirms the feasibility of the method and demonstrates that the ML-based fake factor provides smoother, more stable estimates across the phase space than traditional methods, reducing binning artifacts and improving extrapolation to signal regions.

Neural Fake Factor Estimation Using Data-Based Inference

TL;DR

This work tackles the challenge of estimating fake lepton backgrounds in high-energy physics by addressing the limitations of traditional histogram-based Fake Factor methods. It introduces a data-based inference approach that uses neural density-ratio estimation to compute a continuous, per-event fake factor F(\mathbf{x}) in high-dimensional feature spaces, circumventing binning artefacts. The method is validated on an ATLAS Open Data W\to e\nu analysis, showing smoother, more stable extrapolation from control to signal regions and improved modeling in high-dimensional regimes. Overall, the approach enhances the fidelity and flexibility of data-driven background estimation, with potential for application to multi-lepton final states and more complex analyses, while highlighting directions for uncertainty quantification.

Abstract

In a high-energy physics data analysis, the term "fake" backgrounds refers to events that would formally not satisfy the (signal) process selection criteria, but are accepted nonetheless due to mis-reconstructed particles. This can occur, e.g., when leptons from secondary decays are incorrectly identified as originating from the hard-scatter interaction point (known as non-prompt leptons), or when other physics objects, such as hadronic jets, are mistakenly reconstructed as leptons (resulting in mis-identified leptons). These fake leptons are usually estimated using data-driven techniques, one of the most common being the Fake Factor method. This method relies on predicting the fake lepton contribution by reweighting data events, using a scale factor (i.e. fake factor) function. Traditionally, fake factors have been estimated by histogramming and computing the ratio of two data distributions, typically as functions of a few relevant physics variables such as the transverse momentum and pseudorapidity . In this work, we introduce a novel approach of fake factor calculation, based on density ratio estimation using neural networks trained directly on data in a higher-dimensional feature space. We show that our method enables the computation of a continuous, unbinned fake factor on a per event basis, offering a more flexible, precise, and higher-dimensional alternative to the conventional method, making it applicable to a wide range of analyses. A simple LHC open data analysis we implemented confirms the feasibility of the method and demonstrates that the ML-based fake factor provides smoother, more stable estimates across the phase space than traditional methods, reducing binning artifacts and improving extrapolation to signal regions.

Paper Structure

This paper contains 17 sections, 20 equations, 15 figures.

Figures (15)

  • Figure 1: Fake factor method diagram in case of two leptons. In events with two leptons, the Fake Factor method is applied individually to each lepton. For LT and TL combinations, only one fake factor is assigned. In the signal region, where both leptons are tight (TT), the event weight is 1. When both leptons are loose (LL), two fake factors are applied for both leptons.
  • Figure 2: Visualization of the ABCD method, which is equivalent to the Fake Factor method when using only event counts. To estimate the number of fake leptons in region A, the ratio of the number of fake leptons in kinematically orthogonal tight and loose regions B and D is evaluated first, giving the value of $F$. Since this ratio is assumed to be equal in both SR and CR, the number of fake leptons in region A can be obtained by applying $F$ as a transfer factor to the number of fake leptons in region C.
  • Figure 3: Flow diagram of the ML-based method to obtain the fake factor $F$ as a density ratio $r_F(\mathbf{x})$. Firstly, two independent classifiers are trained in the tight and loose regions to model the ratios $r^\text{T,L}$ between data and MC. These can then be used as correction factors to obtain prompt-subtracted densities by reweighting either data or MC events, giving the two branches of the diagram. Lastly, a third classifier is trained on reweighted events to separate loose and tight prompt-subtracted distributions, which gives the final density ratio $r_F(\mathbf{x})$.
  • Figure 4: Schematic illustration of the classifier model architecture used in this work. Our classifiers use a pre-activation residual network (ResNet) architecture. The numerical features $\mathbf{x}_\text{num.}$, concatenated with the categorical features $\mathbf{x}_\text{cat.}$, are embedded through an embedding layer (EMB) and then passed through a projection layer (if needed) before being fed into the ResNet. The ResNet is schematically shown as a stack of batch normalization (BN), weight multiplication (W), and activation function (ACT) layers, with residual connections between them. The output layer uses either a soft absolute or linear activation function to produce the final output (logit) value $q$, as described in the text.
  • Figure 5: The soft absolute activation function (red) constrains the (logit) outputs $q$ of both subtraction classifier networks to be non-negative, which is required to keep the data correction weights positive. The exponential of the output $r=\exp(q)$, is used to obtain the density ratio estimate (blue), which will be $r>1$ when using the proposed activation function. Data reweighting function (orange) is given in Eq. \ref{['eq:subtraction']}. In our implementation, we reweight data with labels 0 for MC and labels 1 for data. Using the soft absolute activation ensures that the reweighting function remains non-negative and bounded within $[0, 1]$, as required.
  • ...and 10 more figures