Table of Contents
Fetching ...

Scaling Particle Collision Data Analysis

Hengkui Wu, Panpan Chi, Yongfeng Zhu, Liujiang Liu, Shuyang Hu, Yuexin Wang, Chen Zhou, Qihao Wang, Yingsi Xin, Bruce Liu, Dahao Liang, Xinglong Jia, Manqi Ruan

TL;DR

This work tackles the challenge of applying large language models to numerically intensive scientific data by proposing BBT-Neutron, a task-agnostic transformer that uses Binary Tokenization to unify textual and large-scale numerical inputs. The model is benchmarked on Jet Origin Identification (JoI) in high-energy physics and achieves performance on par with specialized domain models such as ParticleNet and Particle Transformer, while exhibiting emergent scaling behavior as data volume grows. The findings suggest that a generalist architecture with binary numerics can scale to complex scientific tasks and may serve as a foundational tool for data analysis across Big Science projects, with potential extensions to other domains requiring precise numerical computation. The work also emphasizes an open-source release and future expansion to additional tasks and modalities, advocating a shift toward task-agnostic, transferable scientific intelligence.

Abstract

For decades, researchers have developed task-specific models to address scientific challenges across diverse disciplines. Recently, large language models (LLMs) have shown enormous capabilities in handling general tasks; however, these models encounter difficulties in addressing real-world scientific problems, particularly in domains involving large-scale numerical data analysis, such as experimental high energy physics. This limitation is primarily due to BPE tokenization's inefficacy with numerical data. In this paper, we propose a task-agnostic architecture, BBT-Neutron, which employs a binary tokenization method to facilitate pretraining on a mixture of textual and large-scale numerical experimental data. We demonstrate the application of BBT-Neutron to Jet Origin Identification (JoI), a critical categorization challenge in high-energy physics that distinguishes jets originating from various quarks or gluons. Our results indicate that BBT-Neutron achieves comparable performance to state-of-the-art task-specific JoI models. Furthermore, we examine the scaling behavior of BBT-Neutron's performance with increasing data volume, suggesting the potential for BBT-Neutron to serve as a foundational model for particle physics data analysis, with possible extensions to a broad spectrum of scientific computing applications for Big Science experiments, industrial manufacturing and spacial computing. The project code is available at https://github.com/supersymmetry-technologies/bbt-neutron.

Scaling Particle Collision Data Analysis

TL;DR

This work tackles the challenge of applying large language models to numerically intensive scientific data by proposing BBT-Neutron, a task-agnostic transformer that uses Binary Tokenization to unify textual and large-scale numerical inputs. The model is benchmarked on Jet Origin Identification (JoI) in high-energy physics and achieves performance on par with specialized domain models such as ParticleNet and Particle Transformer, while exhibiting emergent scaling behavior as data volume grows. The findings suggest that a generalist architecture with binary numerics can scale to complex scientific tasks and may serve as a foundational tool for data analysis across Big Science projects, with potential extensions to other domains requiring precise numerical computation. The work also emphasizes an open-source release and future expansion to additional tasks and modalities, advocating a shift toward task-agnostic, transferable scientific intelligence.

Abstract

For decades, researchers have developed task-specific models to address scientific challenges across diverse disciplines. Recently, large language models (LLMs) have shown enormous capabilities in handling general tasks; however, these models encounter difficulties in addressing real-world scientific problems, particularly in domains involving large-scale numerical data analysis, such as experimental high energy physics. This limitation is primarily due to BPE tokenization's inefficacy with numerical data. In this paper, we propose a task-agnostic architecture, BBT-Neutron, which employs a binary tokenization method to facilitate pretraining on a mixture of textual and large-scale numerical experimental data. We demonstrate the application of BBT-Neutron to Jet Origin Identification (JoI), a critical categorization challenge in high-energy physics that distinguishes jets originating from various quarks or gluons. Our results indicate that BBT-Neutron achieves comparable performance to state-of-the-art task-specific JoI models. Furthermore, we examine the scaling behavior of BBT-Neutron's performance with increasing data volume, suggesting the potential for BBT-Neutron to serve as a foundational model for particle physics data analysis, with possible extensions to a broad spectrum of scientific computing applications for Big Science experiments, industrial manufacturing and spacial computing. The project code is available at https://github.com/supersymmetry-technologies/bbt-neutron.

Paper Structure

This paper contains 11 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Event display of an $e^+e^-\rightarrow \nu\bar{\nu} H\rightarrow \nu\bar{\nu} gg$ ($\sqrt{s}$ = 240 GeV) event simulated and reconstructed with the CEPC baseline detector CEPC_CDR_Phy. Different particles are depicted with colored curves and straight lines: red for $e^{\pm}$, cyan for $\mu^{\pm}$, blue for $\pi^{\pm}$, orange for photons, and magenta for neutral hadrons.
  • Figure 2: With the statistics of each jet one million, 60% of them used for training, 20% for validation, and another 20% for testing, the confusion matrix $M_{11}$ obtained by BBT-Neutron, ParticleNet, and Particle Transformer for $\nu\bar{\nu}H, H\to jj$ events at 240 GeV center-of-mass energy. Each matrix is normalized to unity for each truth label (row).
  • Figure 3: Jet flavor tagging efficiencies and charge flip rates for each quark species with ParticleNet and Particle Transformer.
  • Figure 4: The scaling behavior of BBT-Neutron, ParticleNet (PN), and Particle Transformer (ParT) in terms of jet flavor tagging efficiency as a function of training data volume is illustrated. Panels (a), (b), (c), (d), and (e) correspond to the bottom, charm, strange, up, and down quark flavors, respectively.
  • Figure 5: The scaling behavior of BBT-Neutron, ParticleNet (PN), and Particle Transformer (ParT) in terms of jet charge flip rate as a function of training data volume is illustrated. Panels (a), (b), (c), (d), and (e) correspond to the bottom, charm, strange, up, and down quark flavors, respectively.
  • ...and 2 more figures