Table of Contents
Fetching ...

Improvement and generalization of ABCD method with Bayesian inference

Ezequiel Alvarez, Leandro Da Rold, Manuel Szewc, Alejandro Szynkman, Santiago A. Tanco, Tatiana Tarutina

TL;DR

The paper targets limitations of the ABCD background-estimation method, notably its dependence on hard region definitions, binary observables, and limited use of background information. It introduces a Bayesian mixture-model framework with $K$ classes and $D$ observables, enabling soft-event assignments and posterior inference of class fractions, thereby generalizing ABCD. Through a di-Higgs-inspired toy problem, the authors show that the Bayesian approach can achieve more accurate and robust estimates of the signal fraction than ABCD, even at true signal levels as low as $0.5\%$, and remains well-behaved in the no-signal case. The work presents a principled path toward more information-rich, multi-observable, data-driven background estimates for LHC analyses, while outlining clear steps to extend toward realistic scenarios with additional backgrounds and systematics.

Abstract

To find New Physics or to refine our knowledge of the Standard Model at the LHC is an enterprise that involves many factors. We focus on taking advantage of available information and pour our effort in re-thinking the usual data-driven ABCD method to improve it and to generalize it using Bayesian Machine Learning tools. We propose that a dataset consisting of a signal and many backgrounds is well described through a mixture model. Signal, backgrounds and their relative fractions in the sample can be well extracted by exploiting the prior knowledge and the dependence between the different observables at the event-by-event level with Bayesian tools. We show how, in contrast to the ABCD method, one can take advantage of understanding some properties of the different backgrounds and of having more than two independent observables to measure in each event. In addition, instead of regions defined through hard cuts, the Bayesian framework uses the information of continuous distribution to obtain soft-assignments of the events which are statistically more robust. To compare both methods we use a toy problem inspired by $pp\to hh\to b\bar b b \bar b$, selecting a reduced and simplified number of processes and analysing the flavor of the four jets and the invariant mass of the jet-pairs, modeled with simplified distributions. Taking advantage of all this information, and starting from a combination of biased and agnostic priors, leads us to a very good posterior once we use the Bayesian framework to exploit the data and the mutual information of the observables at the event-by-event level. We show how, in this simplified model, the Bayesian framework outperforms the ABCD method sensitivity in obtaining the signal fraction in scenarios with $1\%$ and $0.5\%$ true signal fractions in the dataset. We also show that the method is robust against the absence of signal.

Improvement and generalization of ABCD method with Bayesian inference

TL;DR

The paper targets limitations of the ABCD background-estimation method, notably its dependence on hard region definitions, binary observables, and limited use of background information. It introduces a Bayesian mixture-model framework with classes and observables, enabling soft-event assignments and posterior inference of class fractions, thereby generalizing ABCD. Through a di-Higgs-inspired toy problem, the authors show that the Bayesian approach can achieve more accurate and robust estimates of the signal fraction than ABCD, even at true signal levels as low as , and remains well-behaved in the no-signal case. The work presents a principled path toward more information-rich, multi-observable, data-driven background estimates for LHC analyses, while outlining clear steps to extend toward realistic scenarios with additional backgrounds and systematics.

Abstract

To find New Physics or to refine our knowledge of the Standard Model at the LHC is an enterprise that involves many factors. We focus on taking advantage of available information and pour our effort in re-thinking the usual data-driven ABCD method to improve it and to generalize it using Bayesian Machine Learning tools. We propose that a dataset consisting of a signal and many backgrounds is well described through a mixture model. Signal, backgrounds and their relative fractions in the sample can be well extracted by exploiting the prior knowledge and the dependence between the different observables at the event-by-event level with Bayesian tools. We show how, in contrast to the ABCD method, one can take advantage of understanding some properties of the different backgrounds and of having more than two independent observables to measure in each event. In addition, instead of regions defined through hard cuts, the Bayesian framework uses the information of continuous distribution to obtain soft-assignments of the events which are statistically more robust. To compare both methods we use a toy problem inspired by , selecting a reduced and simplified number of processes and analysing the flavor of the four jets and the invariant mass of the jet-pairs, modeled with simplified distributions. Taking advantage of all this information, and starting from a combination of biased and agnostic priors, leads us to a very good posterior once we use the Bayesian framework to exploit the data and the mutual information of the observables at the event-by-event level. We show how, in this simplified model, the Bayesian framework outperforms the ABCD method sensitivity in obtaining the signal fraction in scenarios with and true signal fractions in the dataset. We also show that the method is robust against the absence of signal.
Paper Structure (11 sections, 19 equations, 9 figures)

This paper contains 11 sections, 19 equations, 9 figures.

Figures (9)

  • Figure 1: ABCD method: observable ${\cal O}_1$ can take values which are either AB or CD. Whereas observable ${\cal O}_2$ can only take values which are either AC or BD. Assuming that signal is restricted to A, and that the ${\cal O}_{1,2}$ distributions for the background are independent, one has that $N_A(\hbox{background}) = N_B \times N_C / N_D$, see text for details.
  • Figure 2: General Graphical Model for a mixture model. $k$ runs over the $K$ classes, $n$ runs over the $N$ events, and $d$ over the $D$ independent observables. Random variables are represented by circles while arrows represent conditional dependence. White circles represent latent variables which are unobserved while blue circles represent measured variables whose observation conditions the posterior distribution over the parameters. See, for instance, Chapter 8 in Ref. bishop for details about representing a probability density function using a Graphical Model.
  • Figure 3: Graphical Model for the probabilistic model considered for our toy problem. Observation of the six-dimensional data consisting of four $b$-tag scores ${\cal S}_{1..4}$ and the two invariant masses $m_{1,2}$, conditions the posterior distribution of the parameters of interest $\theta=\{ \pi_{k},\alpha_{j},\beta_{j},\lambda,\mu,\sigma\}$. Here $N$ runs over the events and $J$ runs over the two individual types for jet classification, c- and b-jets. Prior hyperparameters not shown here are specified in the text.
  • Figure 4: Top: Data distribution of $b$-tagging score values $\mathcal{S}$ for each of the jet types. True (MAP) distributions are shown in dashed (solid) lines for each jet type, while several distributions sampled from the prior for each individual type are shown in thin solid lines. (Dashed and solid lines have large overlapping.) The MAP distributions are inferred from a dataset with $\pi_{s}=1\%$. The dotted vertical lines correspond to the WP thresholds we use in this work. Notice that data is four-dimensional in the $b$-tagging scores, but here we project it to one-dimension for the sake of showcasing the inference on the individual jet types. Bottom: Difference between MAP and true distributions for each of the jet types.
  • Figure 5: Top: Data distribution of mass values $m$ for each of the individual mass types. True (MAP) distributions are shown in dashed (solid) lines for each mass distribution types, while several distributions sampled from the prior for each type are shown in thin solid lines. The MAP distributions are inferred from a dataset with $\pi_{s}=1\%$. R and NR stand for resonant and non-resonant, respectively. Bottom: Difference between MAP and true distributions for each of the mass types.
  • ...and 4 more figures