Training toward significance with the decorrelated event classifier transformer neural network

Jaebak Kim

Training toward significance with the decorrelated event classifier transformer neural network

Jaebak Kim

TL;DR

The paper addresses improving resonance-search sensitivity by binning events with a transformer-based event classifier while controlling correlations with the reconstructed mass. It introduces an event classifier transformer architecture and three targeted training techniques: extreme loss (mass decorrelation with enhanced significance), DisCo regularization, and data-scope training, plus significance-based epoch selection. In a simplified $H\rightarrow Z(\ell\ell)\gamma$ study, the transformer with these techniques yields the highest expected significance and the lowest mass correlation, outperforming boosted decision trees and feed-forward networks. The work demonstrates a practical approach to decorrelated, significance-optimized bump-hunt analyses with clear methodological and implementation details for reproducibility and SEO.

Abstract

Experimental particle physics uses machine learning for many tasks, where one application is to classify signal and background events. This classification can be used to bin an analysis region to enhance the expected significance for a mass resonance search. In natural language processing, one of the leading neural network architectures is the transformer. In this work, an event classifier transformer is proposed to bin an analysis region, in which the network is trained with special techniques. The techniques developed here can enhance the significance and reduce the correlation between the network's output and the reconstructed mass. It is found that this trained network can perform better than boosted decision trees and feed-forward networks.

Training toward significance with the decorrelated event classifier transformer neural network

TL;DR

study, the transformer with these techniques yields the highest expected significance and the lowest mass correlation, outperforming boosted decision trees and feed-forward networks. The work demonstrates a practical approach to decorrelated, significance-optimized bump-hunt analyses with clear methodological and implementation details for reproducibility and SEO.

Abstract

Paper Structure (14 sections, 8 equations, 7 figures, 1 table)

This paper contains 14 sections, 8 equations, 7 figures, 1 table.

Introduction
Event classifier transformer neural network
Training techniques for enhancing significance
Specialized loss function with mass decorrelation
Data scope training
Significance-based model selection
Example analysis
Dataset
Input features for machine-learning techniques
Machine-learning techniques
Machine-learning technique evaluation metrics
Experiment and results
Related work
Summary and conclusions

Figures (7)

Figure 1: Left: mass of reconstructed Higgs boson candidates from the $H\rightarrow Z\left(\ell^{+}\ell^{-}\right)\gamma$ decay, where a bump can be seen due to the presence of the Higgs boson particle. The Higgs boson cross section was scaled up by 100 to make the bump visible. Right: mass of reconstructed Higgs boson candidates from the $H\rightarrow Z\left(\ell^{+}\ell^{-}\right)\gamma$ decay with the nominal Higgs boson cross section, where the bump cannot be seen due to the background.
Figure 2: Architecture of the event classifier transformer. FFN refers to a feed-forward neural network. Normalize refers to layer normalization. Add refers to the implementation of the residual connection. Concat refers to a layer concatenating tokens. Linear refers to a linear layer.
Figure 3: BCE loss vs extreme loss, when the label is $y=0$. Extreme loss penalizes the neural network more than BCE loss for network predictions that are close to 1.
Figure 4: Top: reconstructed $m_{\ell\ell\gamma}$ background distributions, where each histogram is a bin in the XGBoost output distribution with an equal number of signal events. Lower signal percentile (sig. p) values correspond to higher output values. $p_{T}^{\gamma}/m_{\ell\ell\gamma}$, $p_{T}^{\text{leading }\ell}$, $p_{T}^{\text{subleading }\ell}$ are not used in the training. Bottom: reconstructed $m_{\ell\ell\gamma}$ background distributions, when training includes $p_{T}^{\gamma}/m_{\ell\ell\gamma}$, $p_{T}^{\text{leading }\ell}$, and $p_{T}^{\text{subleading }\ell}$ inputs for XGBoost. For certain bins, the background peaks close to the Higgs boson mass of 125 GeV, which introduces difficulties in estimating the number of signal events. Correlation represents the magnitude of difference in the shapes between the machine-learning bins.
Figure 5: $m_{\ell\ell\gamma}$ distribution of the background, where each histogram is a bin in the machine-learning technique output distribution. Each bin has an equal number of signal events. Lower signal percentile (sig. p) values correspond to higher network output values. Top: deep feed-forward network trained with BCE loss. Bottom: Event classifier transformer network trained with extreme + DisCo loss. Correlation represents the magnitude of difference in the shapes between the machine-learning bins. A lower correlation can be observed with the network trained with DisCo loss.
...and 2 more figures

Training toward significance with the decorrelated event classifier transformer neural network

TL;DR

Abstract

Training toward significance with the decorrelated event classifier transformer neural network

Authors

TL;DR

Abstract

Table of Contents

Figures (7)