Event Tokenization and Next-Token Prediction for Anomaly Detection at the Large Hadron Collider
Ambre Visive, Polina Moskvitina, Clara Nellist, Roberto Ruiz de Austri, Sascha Caron
TL;DR
The paper presents a novel use of encoder-based, LLM-like networks trained on background collider events to perform unsupervised anomaly detection via masked-token reconstruction. Collider events are tokenized into sequences and learned by a two-layer Transformer, with reconstruction scores used to identify deviations from learned background distributions, demonstrated on a four-top-quark production benchmark. The method achieves a ROC-AUC of about 0.67 and is competitive with certain unsupervised approaches while not surpassing the best DDD-based methods, highlighting its potential for model-independent searches. This approach offers a flexible, data-driven pathway for revealing subtle discrepancies in LHC data and could enhance searches for new physics without explicit signal modeling.
Abstract
We propose a novel use of Large Language Models (LLMs) as unsupervised anomaly detectors in particle physics. Using lightweight LLM-like networks with encoder-based architectures trained to reconstruct background events via masked-token prediction, our method identifies anomalies through deviations in reconstruction performance, without prior knowledge of signal characteristics. Applied to searches for simultaneous four-top-quark production, this token-based approach shows competitive performance against established unsupervised methods and effectively captures subtle discrepancies in collider data, suggesting a promising direction for model-independent searches for new physics.
