Table of Contents
Fetching ...

Optimal Transport Event Representation for Anomaly Detection

Aditya Bhargava, Tianji Cai, Benjamin Nachman

TL;DR

This work tackles resonant anomaly detection in collider data under a weak supervision framework and introduces a physics-guided intermediate representation based on optimal transport (OT). By linearizing the $2$-Wasserstein distance into the LinW$_2$ embedding and compressing it with PCA to a few informative components, the authors create OT-based features that augment standard high-level observables. In ultra-low signal regimes ($S/B$ around $0.5\%$), OT$_k$ features achieve significant significance improvements (SI $>25$) and outperform both full low-level phase-space models and pretrained foundation models, while requiring only a modest number of PCA modes. The approach remains robust across datasets (R&D1 and R&D2) and illustrates the value of physically grounded, intermediate representations as a bridge between engineered features and end-to-end ML, with code available at the provided repository.

Abstract

We introduce optimal transport (OT) as a physics-based intermediate event representation for weakly supervised anomaly detection. With only $0.5\%$ injection of resonant signals in the LHC Olympics benchmark datasets, the OT-augmented feature set achieves nearly twice the significance improvement of standard high-level observables, while end-to-end deep learning on low-level four-momenta struggles in the low-signal regime. The gains persist across signal types and classifiers, underscoring the value of structured representations in machine learning for anomaly detection.

Optimal Transport Event Representation for Anomaly Detection

TL;DR

This work tackles resonant anomaly detection in collider data under a weak supervision framework and introduces a physics-guided intermediate representation based on optimal transport (OT). By linearizing the -Wasserstein distance into the LinW embedding and compressing it with PCA to a few informative components, the authors create OT-based features that augment standard high-level observables. In ultra-low signal regimes ( around ), OT features achieve significant significance improvements (SI ) and outperform both full low-level phase-space models and pretrained foundation models, while requiring only a modest number of PCA modes. The approach remains robust across datasets (R&D1 and R&D2) and illustrates the value of physically grounded, intermediate representations as a bridge between engineered features and end-to-end ML, with code available at the provided repository.

Abstract

We introduce optimal transport (OT) as a physics-based intermediate event representation for weakly supervised anomaly detection. With only injection of resonant signals in the LHC Olympics benchmark datasets, the OT-augmented feature set achieves nearly twice the significance improvement of standard high-level observables, while end-to-end deep learning on low-level four-momenta struggles in the low-signal regime. The gains persist across signal types and classifiers, underscoring the value of structured representations in machine learning for anomaly detection.

Paper Structure

This paper contains 7 sections, 2 equations, 5 figures.

Figures (5)

  • Figure 1: Total variance explained by increasing numbers of PCA modes of the OT representations for 10k samples from the R&D1 dataset (green) and the R&D2 dataset (orange).
  • Figure 2: Maximum Significance Improvement (SI) for the R&D1 (left) and R&D2 (right) datasets using BDT ensembles as the classifier, with the signal injection level S/B varying from $0.2\%$ to $10\%$. The group of red curves represent increasing numbers of OT features added to standard high-level observables, i.e., $\{m_{J_1}, m_{J_2}, \tau_{21}^{J_1}, \tau_{21}^{J_2} \}$ for R&D1, and $\{m_{J_1}, m_{J_2}, \tau_{21}^{J_1}, \tau_{21}^{J_2}, \tau_{32}^{J_1}, \tau_{32}^{J_2} \}$ for R&D2. The dashed gray line in the left subplot shows the max SI values from Ref. Buhmann:2023acn using full phase space as inputs to dedicated models, whereas the dashed green line shows the performance of the pre-trained foundation model OmniLearn from Ref. omnilearn.
  • Figure 3: Significance Improvement (SI) curves for the R&D1 (left) and R&D2 (right) datasets at a representative signal injection level of S/B$=0.63\%$, using BDT ensembles as the classifier. The group of red curves represent increasing numbers of OT features added to standard high-level observables, with shaded bands indicating $1\sigma$ variations across the BDT ensembles. For R&D1, the dashed gray line shows the maximum SI values of the full phase space method from Ref. Buhmann:2023acn, and the dashed green line shows that of the pretrained foundation model OmniLearn from Ref. omnilearn.
  • Figure 4: Anomaly detection for the R&D1 dataset using MLP ensembles as the classifier. Left: The SI curves at the signal injection level S/B$=0.63\%$. Right: Maximum SI at S/B between 0.2% and 10%. The group of red curves represent increasing numbers of OT features added to standard high-level observables $\{m_{J_1}, m_{J_2}, \tau_{21}^{J_1}, \tau_{21}^{J_2} \}$, with shaded bands denoting $1\sigma$ variations across the ensembles of trained models. The dashed gray line shows the performance of the full phase space method from Ref. Buhmann:2023acn, whereas the dashed green line shows that of the pre-trained foundation model OmniLearn from Ref. omnilearn.
  • Figure 5: The SI curves for the R&D1 dataset using BDT ensembles as the classifier at S/B$=0.63\%$. Left: The input features are $\{\tau_{21}^{J_1}, \tau_{21}^{J_2} \}$ and different number of PCA modes for the OT representation (the group of red lines). Middle: The input features are $\{m_{J_1}, m_{J_2} \}$ and different number of PCA modes for the OT representation. Note that the middle subplot shares the same $y$-axis with the leftmost subplot. Right: The input features only include different number of PCA modes for the OT representation. Note that the rightmost subplot has its own $y$-axis to the right. Shaded bands denote $1\sigma$ variations across the ensembles of trained models.