Table of Contents
Fetching ...

SURFing to the Fundamental Limit of Jet Tagging

Ian Pang, Darius A. Faroughy, David Shih, Ranit Das, Gregor Kasieczka

TL;DR

This work questions whether jet tagging has reached its fundamental Neyman-Pearson limit by comparing autoregressive GPT-style jet generators with a continuous, permutation-equivariant flow-matching surrogate (EPiC-FM). It introduces the SURF framework to perform exact NP tests by training the target model on surrogate-generated data, enabling ground-truth NP curves for both models. The key finding is that GPT-based likelihoods overstate top vs. QCD separation due to artifacts and overfitting, while EPiC-FM suggests the true limit is not far from current state-of-the-art taggers. The study emphasizes the importance of valid surrogate references and careful evaluation to avoid misinterpreting the ultimate discriminative power in jet tagging, with SURF offering a general approach for such assessments.

Abstract

Beyond the practical goal of improving search and measurement sensitivity through better jet tagging algorithms, there is a deeper question: what are their upper performance limits? Generative surrogate models with learned likelihood functions offer a new approach to this problem, provided the surrogate correctly captures the underlying data distribution. In this work, we introduce the SUrrogate ReFerence (SURF) method, a new approach to validating generative models. This framework enables exact Neyman-Pearson tests by training the target model on samples from another tractable surrogate, which is itself trained on real data. We argue that the EPiC-FM generative model is a valid surrogate reference for JetClass jets and apply SURF to show that modern jet taggers may already be operating close to the true statistical limit. By contrast, we find that autoregressive GPT models unphysically exaggerate top vs. QCD separation power encoded in the surrogate reference, implying that they are giving a misleading picture of the fundamental limit.

SURFing to the Fundamental Limit of Jet Tagging

TL;DR

This work questions whether jet tagging has reached its fundamental Neyman-Pearson limit by comparing autoregressive GPT-style jet generators with a continuous, permutation-equivariant flow-matching surrogate (EPiC-FM). It introduces the SURF framework to perform exact NP tests by training the target model on surrogate-generated data, enabling ground-truth NP curves for both models. The key finding is that GPT-based likelihoods overstate top vs. QCD separation due to artifacts and overfitting, while EPiC-FM suggests the true limit is not far from current state-of-the-art taggers. The study emphasizes the importance of valid surrogate references and careful evaluation to avoid misinterpreting the ultimate discriminative power in jet tagging, with SURF offering a general approach for such assessments.

Abstract

Beyond the practical goal of improving search and measurement sensitivity through better jet tagging algorithms, there is a deeper question: what are their upper performance limits? Generative surrogate models with learned likelihood functions offer a new approach to this problem, provided the surrogate correctly captures the underlying data distribution. In this work, we introduce the SUrrogate ReFerence (SURF) method, a new approach to validating generative models. This framework enables exact Neyman-Pearson tests by training the target model on samples from another tractable surrogate, which is itself trained on real data. We argue that the EPiC-FM generative model is a valid surrogate reference for JetClass jets and apply SURF to show that modern jet taggers may already be operating close to the true statistical limit. By contrast, we find that autoregressive GPT models unphysically exaggerate top vs. QCD separation power encoded in the surrogate reference, implying that they are giving a misleading picture of the fundamental limit.

Paper Structure

This paper contains 21 sections, 10 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Illustration of the SURF method. The conventional baseline approach is shown on the left and the SURF approach is shown on the right. The red lines denote interclass (e.g. top vs QCD) tests and the blue lines denote 2-sample tests. The solid lines correspond to NP-optimal tests, while dashed lines correspond to trained classifier tests (e.g. OmniLearn). Here we see the NP-optimal tests enabled by the SURF method.
  • Figure 2: ROC curves for top vs. QCD jet classification derived from the JetClass dataset and the generative models trained on it. Note that, although physical jets can contain more than 40 constituents, in this study each jet is represented only by its 40 hardest constituents (i.e. $N_\text{max} = 40$). Solid lines indicate the performance of the NP-optimal classifier obtained directly from the true log-likelihood ratio. Dashed lines correspond to classifiers trained with OmniLearn. The QCD rejection at a top tagging efficiency of 50% (R50) is shown in parentheses for each curve. Shaded bands indicate the statistical uncertainty on the optimal ROC curves, estimated from binomial counting errors on the background sample and propagated to the QCD rejection axis.
  • Figure 3: ROC curves for top vs. QCD jet classification derived from the EPiC-FM surrogate reference samples and the GPT models trained on them. Note that, although physical jets can contain more than 40 constituents, in this study each jet is represented only by its 40 hardest constituents (i.e. $N_\text{max} = 40$). Solid lines indicate the performance of the NP-optimal classifier obtained directly from the true log-likelihood ratio. Dashed lines correspond to classifiers trained with OmniLearn. The QCD rejection at a top tagging efficiency of 50% (R50) is shown in parentheses for each curve. Shaded bands indicate the statistical uncertainty on the optimal ROC curves, estimated from binomial counting errors on the background sample and propagated to the QCD rejection axis.
  • Figure 4: ROC curves from the 10-dimensional Gaussian toy model. Although the shifted signal and background distributions differ only slightly from their original counterparts, their mutual ROC curve appears highly inflated. Here, "Original" denote the original signal and background, while "Shifted" indicate the slightly displaced versions. The background rejection at a signal efficiency of 50% (R50) is shown in parentheses for each curve. This demonstrates how small mismodelings of well-separated classes can artificially exaggerate separation power.
  • Figure 5: Comparison of top vs. QCD discrimination for both EPiC-FM surrogate reference jets and JetClass jets. The EPiC-FM curves, shown in red (continuous) and red dashed (bin-smeared, labeled "bs"), represent the true NP optimal. Shaded bands indicate the statistical uncertainty on the optimal ROC curves, estimated from binomial counting errors on the background sample and propagated to the QCD rejection axis. For JetClass, where the true likelihood is not available, the solid and dashed blue curves show the corresponding OmniLearn classifiers trained on continuous and bin-smeared ("bs") JetClass samples, respectively. Across both EPiC-FM and JetClass jets, bin-smearing leads to slightly reduced separability between top and QCD jets.
  • ...and 5 more figures