Table of Contents
Fetching ...

Is Tokenization Needed for Masked Particle Modelling?

Matthew Leigh, Samuel Klein, François Charton, Tobias Golling, Lukas Heinrich, Michael Kagan, Inês Ochoa, Margarita Osadchy

TL;DR

This work tackles masked particle modeling (MPM) for unordered jet constituent sets and questions the necessity of tokenization in self-supervised pretraining. It introduces MPMv2 with architectural enhancements, expanded particle features, and multiple non-tokenized reconstruction tasks, plus a set-to-set flow-matching variant. Across a new jet-focused evaluation suite, MPMv2 and SSFM outperform tokenization-based MPMv1 and untrained baselines on in-distribution, weakly supervised, and out-of-distribution tasks, including secondary vertex finding and heavy-track identification. The results suggest tokenization-free SSL can yield strong, generalizable foundation models for jet physics and potentially influence SSL approaches in related domains.

Abstract

In this work, we significantly enhance masked particle modeling (MPM), a self-supervised learning scheme for constructing highly expressive representations of unordered sets relevant to developing foundation models for high-energy physics. In MPM, a model is trained to recover the missing elements of a set, a learning objective that requires no labels and can be applied directly to experimental data. We achieve significant performance improvements over previous work on MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder. We compare several pre-training tasks and introduce new reconstruction methods that utilize conditional generative models without data tokenization or discretization. We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets, which includes using a wide variety of downstream tasks relevant to jet physics, such as classification, secondary vertex finding, and track identification.

Is Tokenization Needed for Masked Particle Modelling?

TL;DR

This work tackles masked particle modeling (MPM) for unordered jet constituent sets and questions the necessity of tokenization in self-supervised pretraining. It introduces MPMv2 with architectural enhancements, expanded particle features, and multiple non-tokenized reconstruction tasks, plus a set-to-set flow-matching variant. Across a new jet-focused evaluation suite, MPMv2 and SSFM outperform tokenization-based MPMv1 and untrained baselines on in-distribution, weakly supervised, and out-of-distribution tasks, including secondary vertex finding and heavy-track identification. The results suggest tokenization-free SSL can yield strong, generalizable foundation models for jet physics and potentially influence SSL approaches in related domains.

Abstract

In this work, we significantly enhance masked particle modeling (MPM), a self-supervised learning scheme for constructing highly expressive representations of unordered sets relevant to developing foundation models for high-energy physics. In MPM, a model is trained to recover the missing elements of a set, a learning objective that requires no labels and can be applied directly to experimental data. We achieve significant performance improvements over previous work on MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder. We compare several pre-training tasks and introduce new reconstruction methods that utilize conditional generative models without data tokenization or discretization. We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets, which includes using a wide variety of downstream tasks relevant to jet physics, such as classification, secondary vertex finding, and track identification.
Paper Structure (21 sections, 4 equations, 9 figures, 3 tables)

This paper contains 21 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: A comparison of the original MPM encoder-decoder setup (left) and the new model configuration (right). The new model includes multiple reconstruction tasks, swaps the MLP decoder for a transformer, and only encodes the reduced set.
  • Figure 2: A schematic overview of the SSFM model.
  • Figure 3: The in-distribution performance of the fine-tuned models on the JetClass dataset. (\ref{['fig:jetclass']}) shows the accuracy using standard supervised classification as a function of the dataset size. (\ref{['fig:cwola']}) shows the significance-improvement of the models trained in a CWoLa setting as a function of the number of signal samples in the dataset.
  • Figure 4: The performance of the fine-tuned models on the BTag dataset. (\ref{['fig:btag']}) shows the supervised jet classifier accuracy versus the number of samples used in fine-tuning. (\ref{['fig:vtx']}) shows the ARI score for the segmentation task versus the number of secondary vertices within each jet. (\ref{['fig:trk']}) shows the balanced accuracy for the track identification task as a function of the number of tracks in each jet.
  • Figure 5: The distributions of the particle features for the two datasets. The final plot shows the distributions of the particle types $x^\text{id}$ for the two datasets.
  • ...and 4 more figures