MACK: Mismodeling Addressed with Contrastive Knowledge
Liam Rankin Sheldon, Dylan Sheldon Rankin, Philip Harris
TL;DR
MACK addresses the problem of mismodeling between simulated and real data in high-energy physics by using a contrastive learning framework. It trains a siamese network with a featurizer and a projector under a VICReg loss on paired nominal (simulation) and alternate (data-like) samples, using the Energy Mover's Distance to form positive pairs, followed by training a downstream classifier on the nominal representations. The approach reduces the sensitivity of model performance to dataset differences across two jet-tagging tasks (realistic Z′→qq̄ vs QCD and JetNet), though there is a trade-off with nominal peak performance that can be mitigated via controlled fine-tuning. The results suggest MACK yields more stable models suitable for robust analyses and potentially broader applications beyond jet tagging, such as anomaly detection and new physics searches.
Abstract
The use of machine learning methods in high energy physics typically relies on large volumes of precise simulation for training. As machine learning models become more complex they can become increasingly sensitive to differences between this simulation and the real data collected by experiments. We present a generic methodology based on contrastive learning which is able to greatly mitigate this negative effect. Crucially, the method does not require prior knowledge of the specifics of the mismodeling. While we demonstrate the efficacy of this technique using the task of jet-tagging at the Large Hadron Collider, it is applicable to a wide array of different tasks both in and out of the field of high energy physics.
