SURFing to the Fundamental Limit of Jet Tagging
Ian Pang, Darius A. Faroughy, David Shih, Ranit Das, Gregor Kasieczka
TL;DR
This work questions whether jet tagging has reached its fundamental Neyman-Pearson limit by comparing autoregressive GPT-style jet generators with a continuous, permutation-equivariant flow-matching surrogate (EPiC-FM). It introduces the SURF framework to perform exact NP tests by training the target model on surrogate-generated data, enabling ground-truth NP curves for both models. The key finding is that GPT-based likelihoods overstate top vs. QCD separation due to artifacts and overfitting, while EPiC-FM suggests the true limit is not far from current state-of-the-art taggers. The study emphasizes the importance of valid surrogate references and careful evaluation to avoid misinterpreting the ultimate discriminative power in jet tagging, with SURF offering a general approach for such assessments.
Abstract
Beyond the practical goal of improving search and measurement sensitivity through better jet tagging algorithms, there is a deeper question: what are their upper performance limits? Generative surrogate models with learned likelihood functions offer a new approach to this problem, provided the surrogate correctly captures the underlying data distribution. In this work, we introduce the SUrrogate ReFerence (SURF) method, a new approach to validating generative models. This framework enables exact Neyman-Pearson tests by training the target model on samples from another tractable surrogate, which is itself trained on real data. We argue that the EPiC-FM generative model is a valid surrogate reference for JetClass jets and apply SURF to show that modern jet taggers may already be operating close to the true statistical limit. By contrast, we find that autoregressive GPT models unphysically exaggerate top vs. QCD separation power encoded in the surrogate reference, implying that they are giving a misleading picture of the fundamental limit.
