Reusable theory representations for colliders: a demonstrator SMEFT foundation model
Supratim Das Bakshi, T. J. Hobbs, Brandon Kriesten
TL;DR
This work addresses the challenge of exploring the high-dimensional SMEFT parameter space at collider scales by constructing a physics-aligned latent embedding learned from simulated neutral-current Drell–Yan spectra at linear order in $1/\Lambda^2$. A minimalist encoder trained with supervised contrastive loss derives a two-dimensional latent space in which SMEFT-induced deformations map to interpretable directions and clusters corresponding to Wilson-coefficient configurations. The embedding enables downstream tasks such as classification with uncertainty quantification, anomaly detection, and nearest-neighbor retrieval, offering a scalable, transferable representation to complement traditional global fits. While demonstrated at leading order and with simplified uncertainties for a single process, the approach provides a foundation for multi-process, higher-order analyses, and integration with global SMEFT analyses and potential agentic systems in collider phenomenology.
Abstract
We develop a demonstrator foundation model for collider-scale explorations of the Standard Model Effective Field Theory (SMEFT), constructed from contrastive representations of theoretically simulated neutral-current Drell-Yan cross sections. Using a controlled sampling of the Warsaw-basis dimension-6 Wilson-coefficient space at $O(Λ^{-2})$, we generate a corpus of high-resolution differential distributions in $m_{\ell\ell}$ and $p_{T}$, augmented by physics-motivated Monte Carlo replicas with correlated uncertainties. A minimally parameterized encoder network is trained with a supervised contrastive loss to produce a low-dimensional latent manifold on which SMEFT-induced deformations of the Drell-Yan spectrum acquire a well-defined geometric structure. We analyze the resulting embedding and demonstrate that (i) latent directions correlate with characteristic SMEFT shape distortions, including energy-growing four-fermion contributions and electroweak vertex corrections; (ii) clusters in the embedding correspond to families of Wilson-coefficient configurations with similar phenomenological impact; and (iii) the learned representation supports downstream tasks such as classification with uncertainty quantification, anomaly detection, and nearest-neighbor retrieval. While restricted to leading-order SMEFT and simplified uncertainty modeling, this study provides the first step toward a reusable, physics-aligned foundational representation for the theory of New-Physics searches at high-energy colliders. We outline extensions towards a complete global analyses, including multi-process training corpora, higher-order corrections, and multi-objective pretraining.
