Re-Simulation-based Self-Supervised Learning for Pre-Training Foundation Models

Philip Harris; Michael Kagan; Jeffrey Krupa; Benedikt Maier; Nathaniel Woodward

Re-Simulation-based Self-Supervised Learning for Pre-Training Foundation Models

Philip Harris, Michael Kagan, Jeffrey Krupa, Benedikt Maier, Nathaniel Woodward

TL;DR

RS3L introduces a re-simulation-based self-supervised learning framework that intervenes mid-simulation of stochastic physics processes to generate diverse, physics-informed augmentations for contrastive learning. By mapping jets into a compact $8$-dimensional latent space trained with a SimCLR-style objective, RS3L pre-training yields robust representations that transfer well to both in-distribution and out-of-distribution jet tagging tasks, often matching or exceeding fully supervised baselines with less labeled data. The approach demonstrates improved robustness to domain shifts between simulators and real data, and includes a publicly available RS3L dataset to spur further research. Overall, RS3L offers a scalable path toward foundation-model pre-training in science domains with complex, stochastic simulators.

Abstract

Self-Supervised Learning (SSL) is at the core of training modern large machine learning models, providing a scheme for learning powerful representations that can be used in a variety of downstream tasks. However, SSL strategies must be adapted to the type of training data and downstream tasks required. We propose RS3L ("Re-simulation-based self-supervised representation learning"), a novel simulation-based SSL strategy that employs a method of re-simulation to drive data augmentation for contrastive learning in the physical sciences, particularly, in fields that rely on stochastic simulators. By intervening in the middle of the simulation process and re-running simulation components downstream of the intervention, we generate multiple realizations of an event, thus producing a set of augmentations covering all physics-driven variations available in the simulator. Using experiments from high-energy physics, we explore how this strategy may enable the development of a foundation model; we show how RS3L pre-training enables powerful performance in downstream tasks such as discrimination of a variety of objects and uncertainty mitigation. In addition to our results, we make the RS3L dataset publicly available for further studies on how to improve SSL strategies.

Re-Simulation-based Self-Supervised Learning for Pre-Training Foundation Models

TL;DR

-dimensional latent space trained with a SimCLR-style objective, RS3L pre-training yields robust representations that transfer well to both in-distribution and out-of-distribution jet tagging tasks, often matching or exceeding fully supervised baselines with less labeled data. The approach demonstrates improved robustness to domain shifts between simulators and real data, and includes a publicly available RS3L dataset to spur further research. Overall, RS3L offers a scalable path toward foundation-model pre-training in science domains with complex, stochastic simulators.

Abstract

Paper Structure (13 sections, 3 equations, 8 figures, 5 tables)

This paper contains 13 sections, 3 equations, 8 figures, 5 tables.

Introduction
Methods
The RS3L backbone
Data augmentations and training input
Network architecture
Fine-tuning and fully-supervised trainings
Results
Understanding the contrastive space
Fine-tuning on top of the RS3L backbone
In-distribution classification and robustness
Out-of-distribution classification task
Outlook
Acknowledgments

Figures (8)

Figure 1: Illustration of the RS3L setup, including downstream re-simulation, sampling, graph computation, and the construction of positive and negative pairs. These are then used in a contrastive loss function aiming to align positive pairs and push negative pairs apart.
Figure 2: Metrics pertaining to the convergence of the RS3L training as a function of epoch. Top panel: the average of cosine similarity between the positive pairs (anchor jet and augmented jet). Bottom panel: the cosine similarity between the average Higgs vector and average QCD vector. The variation, indicated by the error bands, is computed over three RS3L trainings.
Figure 3: (Left) One of eight features derived in the RS3L pre-training. (Right) Jet substructure variable $N_2$. The main (upper) panel shows the distributions for the nominal parton shower scenario. The ratio panels show the difference between the respective varied distributions and the nominal distribution. For FSR, the up and down variations form a band around the nominal distribution.
Figure 4: Corner plots for the eight outputs of RS3L, split up into Higgs boson (reds) and QCD jets (blues). Only small correlations are observed among feature pairs, as indicated by the Pearson correlation coefficients provided in each subplot.
Figure 5: 2D visualization of the 8D RS3L space, derived via t-SNE dimensionality reduction. Top: A good class separation is seen between Higgs jets and QCD (quark and gluon) jets. Bottom left: Jets shown by parton shower model for Pythia8 and Herwig7 for a RS3L space trained with Herwig7 augmentations. Bottom right: The same for a RS3L space trained without Herwig7 augmentations. The congruence of the different parton shower models is visibly worse in the right-hand scenario.
...and 3 more figures

Re-Simulation-based Self-Supervised Learning for Pre-Training Foundation Models

TL;DR

Abstract

Re-Simulation-based Self-Supervised Learning for Pre-Training Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)