Table of Contents
Fetching ...

Anomaly preserving contrastive neural embeddings for end-to-end model-independent searches at the LHC

Kyle Metzger, Lana Xu, Mia Sodini, Thea K. Arrestad, Katya Govorkova, Gaia Grosso, Philip Harris

TL;DR

The paper addresses anomaly detection at the LHC by learning compact, anomaly-preserving event representations through contrastive neural embeddings. It compares supervised and self-supervised contrastive objectives for both MLP and Transformer encoders, evaluating their effectiveness as inputs to signal-agnostic statistical tests. It finds that supervised contrastive learning delivers the strongest gains across diverse backgrounds and unseen signals, with Transformer architectures offering advantages for complex patterns; applied to a Delphes ADC2021 dataset and a challenging black-box test, the approach demonstrates substantial improvements in discovery power and feasibility for end-to-end, model-independent searches at the LHC.

Abstract

Anomaly detection - identifying deviations from Standard Model predictions - is a key challenge at the Large Hadron Collider due to the size and complexity of its datasets. This is typically addressed by transforming high-dimensional detector data into lower-dimensional, physically meaningful features. We tackle feature extraction for anomaly detection by learning powerful low-dimensional representations via contrastive neural embeddings. This approach preserves potential anomalies indicative of new physics and enables rare signal extraction using novel machine learning-based statistical methods for signal-independent hypothesis testing. We compare supervised and self-supervised contrastive learning methods, for both MLP- and Transformer-based neural embeddings, trained on the kinematic observables of physics objects in LHC collision events. The learned embeddings serve as input representations for signal-agnostic statistical detection methods in inclusive final states. We achieve significant improvement in discovery power for both rare new physics signals and rare Standard Model processes across diverse final states, demonstrating its applicability for efficiently searching for diverse signals simultaneously. We study the impact of architectural choices, contrastive loss formulations, supervision levels, and embedding dimensionality on anomaly detection performance. We show that the optimal representation for background classification does not always maximize sensitivity to new physics signals, revealing an inherent trade-off between background structure preservation and anomaly enhancement. We demonstrate that combining compression with domain knowledge for label encoding produces the most effective data representation for statistical discovery of anomalies.

Anomaly preserving contrastive neural embeddings for end-to-end model-independent searches at the LHC

TL;DR

The paper addresses anomaly detection at the LHC by learning compact, anomaly-preserving event representations through contrastive neural embeddings. It compares supervised and self-supervised contrastive objectives for both MLP and Transformer encoders, evaluating their effectiveness as inputs to signal-agnostic statistical tests. It finds that supervised contrastive learning delivers the strongest gains across diverse backgrounds and unseen signals, with Transformer architectures offering advantages for complex patterns; applied to a Delphes ADC2021 dataset and a challenging black-box test, the approach demonstrates substantial improvements in discovery power and feasibility for end-to-end, model-independent searches at the LHC.

Abstract

Anomaly detection - identifying deviations from Standard Model predictions - is a key challenge at the Large Hadron Collider due to the size and complexity of its datasets. This is typically addressed by transforming high-dimensional detector data into lower-dimensional, physically meaningful features. We tackle feature extraction for anomaly detection by learning powerful low-dimensional representations via contrastive neural embeddings. This approach preserves potential anomalies indicative of new physics and enables rare signal extraction using novel machine learning-based statistical methods for signal-independent hypothesis testing. We compare supervised and self-supervised contrastive learning methods, for both MLP- and Transformer-based neural embeddings, trained on the kinematic observables of physics objects in LHC collision events. The learned embeddings serve as input representations for signal-agnostic statistical detection methods in inclusive final states. We achieve significant improvement in discovery power for both rare new physics signals and rare Standard Model processes across diverse final states, demonstrating its applicability for efficiently searching for diverse signals simultaneously. We study the impact of architectural choices, contrastive loss formulations, supervision levels, and embedding dimensionality on anomaly detection performance. We show that the optimal representation for background classification does not always maximize sensitivity to new physics signals, revealing an inherent trade-off between background structure preservation and anomaly enhancement. We demonstrate that combining compression with domain knowledge for label encoding produces the most effective data representation for statistical discovery of anomalies.

Paper Structure

This paper contains 15 sections, 13 equations, 11 figures, 7 tables.

Figures (11)

  • Figure I: MLP-based neural network architecture used for training the supervised and self-supervised embeddings with SimCLR. Illustration made with Ref. drawio.
  • Figure II: Linear evaluation accuracy on background classes for the MLP- and Transformer-based models trained supervised with a SimCLR loss (dark and light purple, respectively), the Transformer-based model trained supervised with a VICReg loss using either balanced background classes (dark green) or background classes weighted to match the composition expected in data (light green), the MLP-based model trained self-supervised with a SimCLR loss with 50% random masking (light blue), physics-inspired augmentations from Dillon:2021gag (medium blue), or physics- and anomaly-inspired augmentations from Dillon:2021gag (dark blue).
  • Figure III: The median observed $Z$-score as a function of the feature embedding size after injecting a signal into the SM background pseudo-dataset, corresponding to $0.5\%$ of the total integral. Results are shown for the $LQ \rightarrow \tau b$ (upper left), $A \rightarrow 4\ell$ (upper right), $H^{\pm} \rightarrow \tau \nu$ (bottom left), and $H \rightarrow \tau \tau$ (bottom right) signals. The median $Z$-score is presented for different embedding models: the MLP- and Transformer-based models trained with supervision and SimCLR loss (dark and light purple, respectively),the Transformer-based model trained with supervision and VICReg loss using either balanced background classes (dark green) or background classes weighted to match the composition expected in data (light green), the MLP-based model trained self-supervised with a SimCLR loss with 50% random masking (light blue), physics-inspired augmentations from Dillon:2021gag (medium blue), or physics- and anomaly-inspired augmentations from Dillon:2021gag (dark blue). Additionally, we compare these results with three baseline embeddings: 57D source (black circles), the 6D leading object $p_T$ (black triangles), and the 6D VAE (black crosses). The typical $3\sigma$ level for evidence and $5\sigma$ level for discovery are reported in dashed lines. Where empirical Z-scores cannot be computed ($Z>3\sigma$) we rely on the asymptotic formula.
  • Figure IV: Anomaly detection applied to $A\rightarrow4l$ signal benchmark. The outcome of the NPLM test statistic for 300 toy experiments with $0.5\%$ signal injection (green histogram) are compared to 1000 experiments in absence of signal (purple histogram), representing the empirical null hypothesis. The four panels report the results for different data embeddings: the original 57-dimensional representation (top left); the 6-dimensional representation given by the six highest $p_T$ within the 19 objects in the event (top right); the 6-dimensional neural embedding given by the variational autoencoder (bottom left); and the 4-dimensional neural embedding given by the Transformer-based architecture trained with supervised Sim-CLR loss (bottom right). In each panel we report the power of the test at 1, 2, and 3 $\sigma$ level of discovery obtained from the empirical distribution or the asymptotic $\chi^2$ showed in solid blue line.
  • Figure V: UMAP visualization of a "black box" batch. We inspect the most significant batch (split 9) by means of a two-dimensional UMAP, where the color code represents the sigmoid activated score output by NPLM. The plot only shows data with score grater than 0.5. Data points are ordered in score so that higher score data points are on top and always visible.
  • ...and 6 more figures