Table of Contents
Fetching ...

OmniJet-$α$: The first cross-task foundation model for particle physics

Joschka Birk, Anna Hallin, Gregor Kasieczka

TL;DR

OmniJet-alpha presents a cross-task foundation-model framework for particle physics by learning discrete jet-constituent representations with a VQ-VAE and modeling them with an autoregressive transformer. It demonstrates both unsupervised jet-generation and supervised jet-tagging, showing that a generatively pre-trained backbone improves classification with limited labeled data. The work introduces token-quality metrics to guide tokenization and shows that large, conditional token vocabularies yield superior fidelity compared to other tokenization schemes. This cross-task transfer marks a first step toward reusable foundation models in high-energy physics, with potential reductions in data and compute requirements for future analyses.

Abstract

Foundation models are multi-dataset and multi-task machine learning methods that once pre-trained can be fine-tuned for a large variety of downstream applications. The successful development of such general-purpose models for physics data would be a major breakthrough as they could improve the achievable physics performance while at the same time drastically reduce the required amount of training time and data. We report significant progress on this challenge on several fronts. First, a comprehensive set of evaluation methods is introduced to judge the quality of an encoding from physics data into a representation suitable for the autoregressive generation of particle jets with transformer architectures (the common backbone of foundation models). These measures motivate the choice of a higher-fidelity tokenization compared to previous works. Finally, we demonstrate transfer learning between an unsupervised problem (jet generation) and a classic supervised task (jet tagging) with our new OmniJet-$α$ model. This is the first successful transfer between two different and actively studied classes of tasks and constitutes a major step in the building of foundation models for particle physics.

OmniJet-$α$: The first cross-task foundation model for particle physics

TL;DR

OmniJet-alpha presents a cross-task foundation-model framework for particle physics by learning discrete jet-constituent representations with a VQ-VAE and modeling them with an autoregressive transformer. It demonstrates both unsupervised jet-generation and supervised jet-tagging, showing that a generatively pre-trained backbone improves classification with limited labeled data. The work introduces token-quality metrics to guide tokenization and shows that large, conditional token vocabularies yield superior fidelity compared to other tokenization schemes. This cross-task transfer marks a first step toward reusable foundation models in high-energy physics, with potential reductions in data and compute requirements for future analyses.

Abstract

Foundation models are multi-dataset and multi-task machine learning methods that once pre-trained can be fine-tuned for a large variety of downstream applications. The successful development of such general-purpose models for physics data would be a major breakthrough as they could improve the achievable physics performance while at the same time drastically reduce the required amount of training time and data. We report significant progress on this challenge on several fronts. First, a comprehensive set of evaluation methods is introduced to judge the quality of an encoding from physics data into a representation suitable for the autoregressive generation of particle jets with transformer architectures (the common backbone of foundation models). These measures motivate the choice of a higher-fidelity tokenization compared to previous works. Finally, we demonstrate transfer learning between an unsupervised problem (jet generation) and a classic supervised task (jet tagging) with our new OmniJet- model. This is the first successful transfer between two different and actively studied classes of tasks and constitutes a major step in the building of foundation models for particle physics.
Paper Structure (18 sections, 4 equations, 18 figures)

This paper contains 18 sections, 4 equations, 18 figures.

Figures (18)

  • Figure 1: Schematics of the different steps (tokenization, generation, classification) in the OmniJet-$\alpha$ model.
  • Figure 2: Architecture of the transformer backbone component of OmniJet-$\alpha$. The data that has been encoded by the VQ-VAE is fed through an embedding layer, before it reaches the main part of the model which is based on the transformer decoder. The output of the transformer decoder blocks is passed to a task specific head, for either generation or classification tasks. Note that during inference of the generative model, the model does not receive complete token sequences, but only the start token. The model will then autoregressively generate the rest of the sequence, updating its input as it progresses, as described in the text.
  • Figure 3: Visualization of the reconstructed tokens in physical space (i.e. $p_\text{T}$, $\eta^{\text{rel}}$, $\phi^{\text{rel}}$) for different tokenization approaches and codebook sizes. Each figure label indicates the codebook size and the tokenization approach. The unconditional tokenization, as well as the binning approach only have one reconstruction for each token, independent of the other tokens in the jet. To visualize the reconstruction spread of the conditional tokens, each token is reconstructed 500 times, each time conditioned on 50 randomly selected tokens from the codebook. Each colored blob corresponds to the reconstructions obtained for one token.
  • Figure 4: (left) Jet mass distribution for all ten jet types combined. (center) Difference between the mass after tokenization and the initial mass for $t\to bqq'$ jets. (right) Difference between the $\tau_{32}$ of the initial jets and the reconstructed jets for $t\to bqq'$ jets.
  • Figure 5: Token quality evaluation using a multi-class classifier. The classifier accuracy is shown for different codebook sizes and different classifier architectures (purple and green). The classifiers are also trained on the original constituents, showing an upper limit for the achievable accuracy. The reconstructed constituents are obtained using the conditional tokenization. The reported values and the uncertainty band correspond to the mean and standard deviation over 5 trainings with different random seeds for the randomly initialized weights of the classifier.
  • ...and 13 more figures