Scaling Particle Collision Data Analysis
Hengkui Wu, Panpan Chi, Yongfeng Zhu, Liujiang Liu, Shuyang Hu, Yuexin Wang, Chen Zhou, Qihao Wang, Yingsi Xin, Bruce Liu, Dahao Liang, Xinglong Jia, Manqi Ruan
TL;DR
This work tackles the challenge of applying large language models to numerically intensive scientific data by proposing BBT-Neutron, a task-agnostic transformer that uses Binary Tokenization to unify textual and large-scale numerical inputs. The model is benchmarked on Jet Origin Identification (JoI) in high-energy physics and achieves performance on par with specialized domain models such as ParticleNet and Particle Transformer, while exhibiting emergent scaling behavior as data volume grows. The findings suggest that a generalist architecture with binary numerics can scale to complex scientific tasks and may serve as a foundational tool for data analysis across Big Science projects, with potential extensions to other domains requiring precise numerical computation. The work also emphasizes an open-source release and future expansion to additional tasks and modalities, advocating a shift toward task-agnostic, transferable scientific intelligence.
Abstract
For decades, researchers have developed task-specific models to address scientific challenges across diverse disciplines. Recently, large language models (LLMs) have shown enormous capabilities in handling general tasks; however, these models encounter difficulties in addressing real-world scientific problems, particularly in domains involving large-scale numerical data analysis, such as experimental high energy physics. This limitation is primarily due to BPE tokenization's inefficacy with numerical data. In this paper, we propose a task-agnostic architecture, BBT-Neutron, which employs a binary tokenization method to facilitate pretraining on a mixture of textual and large-scale numerical experimental data. We demonstrate the application of BBT-Neutron to Jet Origin Identification (JoI), a critical categorization challenge in high-energy physics that distinguishes jets originating from various quarks or gluons. Our results indicate that BBT-Neutron achieves comparable performance to state-of-the-art task-specific JoI models. Furthermore, we examine the scaling behavior of BBT-Neutron's performance with increasing data volume, suggesting the potential for BBT-Neutron to serve as a foundational model for particle physics data analysis, with possible extensions to a broad spectrum of scientific computing applications for Big Science experiments, industrial manufacturing and spacial computing. The project code is available at https://github.com/supersymmetry-technologies/bbt-neutron.
