Table of Contents
Fetching ...

BigBang-Proton Technical Report: Next-Word-Prediction is Scientific Multitask Learner

Hengkui Wu, Liujiang Liu, Jihua He, Qihao Wang, Keke Zhao, Shuyang Hu, Renle Fu, Dahao Liang, Lingyu Zeng, Bruce Liu, Yuan Liu, Jin Zhan, Jiaqiang Niu, Xinglong Jia, Yaqin Hu, Wenjun Ji, Panpan Chi, Ken Chen, Hengyuan Wu, Yingsi Xin, Yongfeng Zhu, Yuexin Wang, Manqi Ruan, Ningtao Bian, Xiaohua Wu, Weipeng Xu

TL;DR

BigBang-Proton presents a generalist, multitask language model designed to perform language-guided scientific computing across scales, structures, and disciplines. It unifies textual theory and numerical experiment through Theory-Experiment Learning, replaces traditional tokenization with Binary Patch Encoding, and enables ultra-long context with Monte Carlo Attention, all within a 1.5B-parameter, 20-layer architecture. The model demonstrates strong performance across arithmetic, particle-jet tagging, inter-atomic potential regression, lake-water spatiotemporal prediction, and genome sequence tasks, often matching or exceeding specialized models while preserving multitask capabilities. The results argue that structure-aware, cross-disciplinary pretraining can unlock robust, generalizable scientific reasoning, and the authors propose universe-scale pretraining as a provocative direction toward foundational material-world AI with broad implications for science and engineering.

Abstract

We introduce BigBang-Proton, a unified sequence-based architecture for auto-regressive language modeling pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scientific multi-task learner. BigBang-Proton incorporates three fundamental innovations compared to mainstream general-purpose LLMs: Theory-Experiment Learning paradigm aligns large-scale numerical experimental data with theoretical text corpora; Binary Patch Encoding replaces byte pair encoding(BPE) tokenization; Monte Carlo Attention substitutes traditional transformer architectures. Through next-word-prediction pretraining on cross-discipline scientific datasets of real-world problems mixed with general textual corpus, followed by fine-tuning and inference on downstream tasks, BigBang-Proton demonstrates 100\% accuracy in up to 50-digit arithmetic addition operations, performance on par with leading specialized models in particle physics jet tagging, matching MAE of specialized models in inter-atomic potential simulation, performance comparable to traditional spatiotemporal models in water quality prediction, and benchmark-exceeding performance in genome modeling. These results prove that language-guided scientific computing can match or exceed the performance of task-specific scientific models while maintaining multitask learning capabilities. We further hypothesize to scale the pretraining to the universe scale as a fundamental step toward developing material world foundational model.

BigBang-Proton Technical Report: Next-Word-Prediction is Scientific Multitask Learner

TL;DR

BigBang-Proton presents a generalist, multitask language model designed to perform language-guided scientific computing across scales, structures, and disciplines. It unifies textual theory and numerical experiment through Theory-Experiment Learning, replaces traditional tokenization with Binary Patch Encoding, and enables ultra-long context with Monte Carlo Attention, all within a 1.5B-parameter, 20-layer architecture. The model demonstrates strong performance across arithmetic, particle-jet tagging, inter-atomic potential regression, lake-water spatiotemporal prediction, and genome sequence tasks, often matching or exceeding specialized models while preserving multitask capabilities. The results argue that structure-aware, cross-disciplinary pretraining can unlock robust, generalizable scientific reasoning, and the authors propose universe-scale pretraining as a provocative direction toward foundational material-world AI with broad implications for science and engineering.

Abstract

We introduce BigBang-Proton, a unified sequence-based architecture for auto-regressive language modeling pretrained on cross-scale, cross-structure, cross-discipline real-world scientific tasks to construct a scientific multi-task learner. BigBang-Proton incorporates three fundamental innovations compared to mainstream general-purpose LLMs: Theory-Experiment Learning paradigm aligns large-scale numerical experimental data with theoretical text corpora; Binary Patch Encoding replaces byte pair encoding(BPE) tokenization; Monte Carlo Attention substitutes traditional transformer architectures. Through next-word-prediction pretraining on cross-discipline scientific datasets of real-world problems mixed with general textual corpus, followed by fine-tuning and inference on downstream tasks, BigBang-Proton demonstrates 100\% accuracy in up to 50-digit arithmetic addition operations, performance on par with leading specialized models in particle physics jet tagging, matching MAE of specialized models in inter-atomic potential simulation, performance comparable to traditional spatiotemporal models in water quality prediction, and benchmark-exceeding performance in genome modeling. These results prove that language-guided scientific computing can match or exceed the performance of task-specific scientific models while maintaining multitask learning capabilities. We further hypothesize to scale the pretraining to the universe scale as a fundamental step toward developing material world foundational model.

Paper Structure

This paper contains 50 sections, 30 equations, 28 figures, 10 tables.

Figures (28)

  • Figure 2: Binary Patch Encoding: how a particle jet is converted into a series of byte tokens. This multi-modality native method converts all dataset formats originally stored in bits to byte sequences using UTF-8. Patching is introduced to reduce computational complexity. Binary Patch Encoding eliminates the tokenization process and vocabulary requirements.
  • Figure 3: Overview of Monte Carlo Attention architecture. The model consists of three main components: (1) input embedding, which converts discrete input tokens into dense vector representations; (2) Monte Carlo Attention, which utilizes an inter-patch-delegation mechanism to drive local and global information exchange, leading to context length growth proportional to the power of layer numbers while maintaining linear computational complexity; and (3) a Feed Forward temporal convolutional network (TCN), which replaces traditional Feed Forward fully connected networks in transformers and captures local spatial and temporal patterns. Since TCN learns positional information, positional embeddings used in transformers are eliminated.
  • Figure 4: Embedding vectors are reorganized between patches. Each patch sends delegates to and receives delegates from other patches for information exchange through attention computations.
  • Figure 5: Layer-wise inter-patch delegation operations drive the context length of information flow to increase by $P^{N+1}$, where $P$ is the patch size and $N$ is the number of layers. For patch size=32, in layer one, information can reach 992, and in layer two, 32736.
  • Figure 6: Training loss and perplexity during pre-training on a heterogeneous corpus of 9 diverse datasets
  • ...and 23 more figures