Neural Fake Factor Estimation Using Data-Based Inference
Jan Gavranovič, Lara Čalić, Jernej Debevc, Else Lytken, Borut Paul Kerševan
TL;DR
This work tackles the challenge of estimating fake lepton backgrounds in high-energy physics by addressing the limitations of traditional histogram-based Fake Factor methods. It introduces a data-based inference approach that uses neural density-ratio estimation to compute a continuous, per-event fake factor F(\mathbf{x}) in high-dimensional feature spaces, circumventing binning artefacts. The method is validated on an ATLAS Open Data W\to e\nu analysis, showing smoother, more stable extrapolation from control to signal regions and improved modeling in high-dimensional regimes. Overall, the approach enhances the fidelity and flexibility of data-driven background estimation, with potential for application to multi-lepton final states and more complex analyses, while highlighting directions for uncertainty quantification.
Abstract
In a high-energy physics data analysis, the term "fake" backgrounds refers to events that would formally not satisfy the (signal) process selection criteria, but are accepted nonetheless due to mis-reconstructed particles. This can occur, e.g., when leptons from secondary decays are incorrectly identified as originating from the hard-scatter interaction point (known as non-prompt leptons), or when other physics objects, such as hadronic jets, are mistakenly reconstructed as leptons (resulting in mis-identified leptons). These fake leptons are usually estimated using data-driven techniques, one of the most common being the Fake Factor method. This method relies on predicting the fake lepton contribution by reweighting data events, using a scale factor (i.e. fake factor) function. Traditionally, fake factors have been estimated by histogramming and computing the ratio of two data distributions, typically as functions of a few relevant physics variables such as the transverse momentum $p_\text{T}$ and pseudorapidity $η$. In this work, we introduce a novel approach of fake factor calculation, based on density ratio estimation using neural networks trained directly on data in a higher-dimensional feature space. We show that our method enables the computation of a continuous, unbinned fake factor on a per event basis, offering a more flexible, precise, and higher-dimensional alternative to the conventional method, making it applicable to a wide range of analyses. A simple LHC open data analysis we implemented confirms the feasibility of the method and demonstrates that the ML-based fake factor provides smoother, more stable estimates across the phase space than traditional methods, reducing binning artifacts and improving extrapolation to signal regions.
