Table of Contents
Fetching ...

RINO: Renormalization Group Invariance with No Labels

Zichun Hao, Raghav Kansal, Abhijith Gandrakota, Chang Sun, Ngadiuba Jennifer, Javier Duarte, Maria Spiropulu

TL;DR

RINO tackles the domain shift between MC simulations and real collider data by pretraining a transformer-based jet representation model on unlabeled, data-like jets to learn energy-scale invariant features. It uses a DINO-like self-distillation framework with views derived from kt-clustering depths, approximating renormalization-group flow in QCD to enforce scale invariance. Evaluated on JetClass (data-like QCD jets) and JetNet (simulations) for top tagging, RINO yields 15–26% accuracy gains over a supervised baseline on JetClass while maintaining competitive in-domain performance on JetNet, demonstrating meaningful cross-domain transfers. This work suggests a practical pathway to more robust HEP ML models with reduced reliance on MC simulations, accompanied by open-source code for replication.

Abstract

A common challenge with supervised machine learning (ML) in high energy physics (HEP) is the reliance on simulations for labeled data, which can often mismodel the underlying collision or detector response. To help mitigate this problem of domain shift, we propose RINO (Renormalization Group Invariance with No Labels), a self-supervised learning approach that can instead pretrain models directly on collision data, learning embeddings invariant to renormalization group flow scales. In this work, we pretrain a transformer-based model on jets originating from quantum chromodynamic (QCD) interactions from the JetClass dataset, emulating real QCD-dominated experimental data, and then finetune on the JetNet dataset -- emulating simulations -- for the task of identifying jets originating from top quark decays. RINO demonstrates improved generalization from the JetNet training data to JetClass data compared to supervised training on JetNet from scratch, demonstrating the potential for RINO pretraining on real collision data followed by fine-tuning on small, high-quality MC datasets, to improve the robustness of ML models in HEP.

RINO: Renormalization Group Invariance with No Labels

TL;DR

RINO tackles the domain shift between MC simulations and real collider data by pretraining a transformer-based jet representation model on unlabeled, data-like jets to learn energy-scale invariant features. It uses a DINO-like self-distillation framework with views derived from kt-clustering depths, approximating renormalization-group flow in QCD to enforce scale invariance. Evaluated on JetClass (data-like QCD jets) and JetNet (simulations) for top tagging, RINO yields 15–26% accuracy gains over a supervised baseline on JetClass while maintaining competitive in-domain performance on JetNet, demonstrating meaningful cross-domain transfers. This work suggests a practical pathway to more robust HEP ML models with reduced reliance on MC simulations, accompanied by open-source code for replication.

Abstract

A common challenge with supervised machine learning (ML) in high energy physics (HEP) is the reliance on simulations for labeled data, which can often mismodel the underlying collision or detector response. To help mitigate this problem of domain shift, we propose RINO (Renormalization Group Invariance with No Labels), a self-supervised learning approach that can instead pretrain models directly on collision data, learning embeddings invariant to renormalization group flow scales. In this work, we pretrain a transformer-based model on jets originating from quantum chromodynamic (QCD) interactions from the JetClass dataset, emulating real QCD-dominated experimental data, and then finetune on the JetNet dataset -- emulating simulations -- for the task of identifying jets originating from top quark decays. RINO demonstrates improved generalization from the JetNet training data to JetClass data compared to supervised training on JetNet from scratch, demonstrating the potential for RINO pretraining on real collision data followed by fine-tuning on small, high-quality MC datasets, to improve the robustness of ML models in HEP.

Paper Structure

This paper contains 25 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: RINO training strategy (left): pretrain on JetClass QCD jets as a proxy for real data, then finetune on JetNet top tagging. Supervised baseline strategy (right): train from scratch on JetNet. Both strategies are then evaluated on top tagging on JetClass.
  • Figure 2: Visualization of learned jet representations from the base model using PCA (left) and t-SNE embedding (right). Jet representations of all five models are shown in Appendix \ref{['appendix:experiments']}.
  • Figure 3: Model architecture of the transformer encoder backbone. The jet's representation is taken as the transformer embedding corresponding to the jet token.
  • Figure 4: Visualization of learned jet representations from the nano (row 1), lite (row 2), mini (row 3), and base (row 4) models using PCA (left) and t-SNE embedding (right).
  • Figure 5: Confusion matrix of BDT for hadronic top class (Tbqq) vs QCD class from JetClass on the learned jet embeddings from pretrained nano (top left), lite (top right), mini (bottom left), and base (bottom right) models.