Foundation models for high-energy physics

Anna Hallin

Foundation models for high-energy physics

Anna Hallin

TL;DR

The paper surveys the emergence of foundation models in high-energy physics, proposing a broad working definition that encompasses large-scale pretraining, latent representations, and downstream finetuning. It argues that collider data—being multimodal, abundant in simulations, and rich in downstream tasks—provides a fertile ground for developing HEP-specific foundation models from scratch. A detailed exemplar, OmniJet-$\alpha$Birk, demonstrates a cross-task model that tokenizes jet constituents with a VQ-VAE, then uses autoregressive generation for downstream classification, trained on unlabeled data and applicable across jets, calorimeter showers, and tau physics. The review also compares a spectrum of models (ParT, MPM, RS3L, OmniLearn, L-GATr, J-JEPA, Bumblebee, HEP-JEPA, etc.) across pretraining objectives, supervision level, and data representations, highlighting trade-offs between simulation dependence and task-specific tailoring. It concludes that while these models hold potential for improved physics outcomes and efficiency, their development requires substantial compute and collaborative efforts, especially to scale in the HL-LHC era.

Abstract

The rise of foundation models -- large, pretrained machine learning models that can be finetuned to a variety of tasks -- has revolutionized the fields of natural language processing and computer vision. In high-energy physics, the question of whether these models can be implemented directly in physics research, or even built from scratch, tailored for particle physics data, has generated an increasing amount of attention. This review, which is the first on the topic of foundation models in high-energy physics, summarizes and discusses the research that has been published in the field so far.

Foundation models for high-energy physics

TL;DR

Birk, demonstrates a cross-task model that tokenizes jet constituents with a VQ-VAE, then uses autoregressive generation for downstream classification, trained on unlabeled data and applicable across jets, calorimeter showers, and tau physics. The review also compares a spectrum of models (ParT, MPM, RS3L, OmniLearn, L-GATr, J-JEPA, Bumblebee, HEP-JEPA, etc.) across pretraining objectives, supervision level, and data representations, highlighting trade-offs between simulation dependence and task-specific tailoring. It concludes that while these models hold potential for improved physics outcomes and efficiency, their development requires substantial compute and collaborative efforts, especially to scale in the HL-LHC era.

Foundation models for high-energy physics

TL;DR

Abstract

Foundation models for high-energy physics

TL;DR

Abstract

Paper Structure

Table of Contents