Neural Scaling Laws for Boosted Jet Tagging

Matthias Vigl; Nicole Hartman; Michael Kagan; Lukas Heinrich

Neural Scaling Laws for Boosted Jet Tagging

Matthias Vigl, Nicole Hartman, Michael Kagan, Lukas Heinrich

TL;DR

This work derives compute optimal scaling laws and identifies an effective performance limit that can be consistently approached through increased compute, demonstrating that increased compute reliably drives performance toward an asymptotic limit, and that more expressive, lower-level features can raise the performance limit and improve results at fixed dataset size.

Abstract

The success of Large Language Models (LLMs) has established that scaling compute, through joint increases in model capacity and dataset size, is the primary driver of performance in modern machine learning. While machine learning has long been an integral component of High Energy Physics (HEP) data analysis workflows, the compute used to train state-of-the-art HEP models remains orders of magnitude below that of industry foundation models. With scaling laws only beginning to be studied in the field, we investigate neural scaling laws for boosted jet classification using the public JetClass dataset. We derive compute optimal scaling laws and identify an effective performance limit that can be consistently approached through increased compute. We study how data repetition, common in HEP where simulation is expensive, modifies the scaling yielding a quantifiable effective dataset size gain. We then study how the scaling coefficients and asymptotic performance limits vary with the choice of input features and particle multiplicity, demonstrating that increased compute reliably drives performance toward an asymptotic limit, and that more expressive, lower-level features can raise the performance limit and improve results at fixed dataset size.

Neural Scaling Laws for Boosted Jet Tagging

TL;DR

Abstract

Paper Structure (9 sections, 5 equations, 6 figures, 2 tables)

This paper contains 9 sections, 5 equations, 6 figures, 2 tables.

Introduction
Related work
Dataset and training setup
Scaling Laws
Compute-Optimal Scaling
Scaling under data repetition
Input features dependence
Physics Performance
Conclusion

Figures (6)

Figure 1: Each point represents the validation loss of a Transformer encoder with $N$ parameters (varying the embedding dimension) trained on $D$ unique samples for exactly one epoch (no data repetition). Models are trained across a grid of ($N$,$D$) configurations spanning several orders of magnitude in both axes. The parametric form in Eq. (\ref{['eq:scaling_law']}) is then fit jointly to all points. (a) Loss surface $L(N,D)$ as a function of model size $N$ and training dataset size $D$, with iso-loss contours shown as colored lines and the compute-optimal trajectory as the blue dashed line. The iso-FLOP lines are colored by the corresponding compute budget, with the color scale shown in (b). (b) Loss as a function of model size at fixed compute budget (iso-FLOP curves), with the compute-optimal trajectory crossing the minima of each curve.
Figure 2: Scaling under data repetition. (a) Validation loss as a function of total training compute for models trained on fixed dataset sizes ranging from D = 1k to D = 1M, compared to the compute-optimal scaling $L \propto C^{-0.15}$ (dashed line). Each point represents the early-stopped validation loss of a single model, obtained by stopping training at the epoch with lowest validation loss. Models are trained across a two-dimensional grid: the dataset size $D$ is indicated by the line style, with the embedding dimension $d\in \{4, 8, 16, 32, 64, 96, 128, 256, 512 \}$ also being varied and the resulting model size $N$ indicated by the color scale. Models trained with data repetition consistently lie above the compute-optimal frontier, with validation loss eventually either saturating or overfitting. (b) Training model sizes above the overfitting threshold allows to minimize the validation loss at each fixed dataset size. Scaling along this trajectory yields the same compute exponent, at the price of roughly a factor of 10 in compute to reach the same loss as compute-optimal scaling with no data repetition. (c) Early stopped validation loss as a function of dataset size $D$ for models trained above the overfitting threshold with data repetition (black points), compared to scaling with no data repetition under the large $N$ regime (negligible model size term) $L(N\to\infty,D) = L_\infty + B / D^{\beta}$. Both curves are compute sub-optimal, as they correspond to the limit of large model size rather than the compute-optimal allocation. Although training on repeated data effectively amplifies the dataset, the gain is bounded by $\omega D_\text{rep}$.
Figure 3: Overfitting threshold in the $(N, D)$ plane. Blue dots indicate models in the underfitting regime, where increasing training time leads to a plateau in the validation loss. Red crosses indicate models that overfit, where the validation loss eventually increases with further training. The green line shows the fitted threshold $N \propto D^{0.47}$, corresponding to a roughly square-root scaling between the minimum model size needed to overfit and the dataset size. 68% confidence intervals are obtained by bootstrapping threshold points.
Figure 4: Loss as a function of training dataset size $D$ for models trained above the overfitting threshold, using four configurations: kinematic variables only ($\Delta\eta,\Delta\phi,log\,p_T$), and the full feature set with 10, 40, and 128 particles per jet. Dashed lines show fits to $L(D_{\mathrm{rep}}) = L_\infty + B_{\mathrm{rep}}/D_{\mathrm{rep}}^{\beta_{\mathrm{rep}}}$. The scaling exponent $\beta_{\mathrm{rep}}$ is roughly constant across configurations, while the asymptotic loss $L_\infty$ decreases with richer inputs and higher particle multiplicity, which also reach the same performance with significantly less amounts of data.
Figure 5: QCD background rejection for each signal class at 50% efficiency, except for $H\to l\nu qq'$ (99%) and $t\to bl\nu$ (99.5%), as a function of total loss and dataset size when training for multiple epochs above the overfitting threshold. The performance of the ParT architecture trained on 100M samples is shown for reference (dashed lines), which is crossed as expected at the 100M jets scale level. Each colored curve corresponds to one of the nine signal processes in the JetClass dataset.
...and 1 more figures

Neural Scaling Laws for Boosted Jet Tagging

TL;DR

Abstract

Neural Scaling Laws for Boosted Jet Tagging

Authors

TL;DR

Abstract

Table of Contents

Figures (6)