On the Value of Tokeniser Pretraining in Physics Foundation Models

Hadi Sotoudeh; Payel Mukhopadhyay; Ruben Ohana; Michael McCabe; Neil D. Lawrence; Shirley Ho; Miles Cranmer

On the Value of Tokeniser Pretraining in Physics Foundation Models

Hadi Sotoudeh, Payel Mukhopadhyay, Ruben Ohana, Michael McCabe, Neil D. Lawrence, Shirley Ho, Miles Cranmer

TL;DR

This work shows that pretraining the tokeniser with an autoencoding objective prior to training the dynamics model enhances computational efficiency for physics emulation, and introduces flexible spatiotemporal compression operations that extend causal convolutions to support runtime-adjustable compression ratios, enabling efficient adaptation to diverse downstream tasks.

Abstract

We investigate the impact of tokeniser pretraining on the accuracy and efficiency of physics emulation. Modern high-resolution simulations produce vast volumes of data spanning diverse physical regimes and scales. Training foundation models to learn the dynamics underlying such data enables the modelling of complex multiphysics phenomena, especially in data-limited settings. The emerging class of physics foundation models typically aims to learn two tasks jointly: (i) extracting compact representations of high-resolution spatiotemporal data, and (ii) capturing governing physical dynamics. However, learning both tasks from scratch simultaneously can impede the effectiveness of either process. We demonstrate that pretraining the tokeniser with an autoencoding objective prior to training the dynamics model enhances computational efficiency for downstream tasks. Notably, the magnitude of this benefit depends on domain alignment: pretraining on the same physical system as the downstream task yields the largest improvements, while pretraining on other systems provides moderate gains. In-domain pretraining reduces VRMSE by 64% after 10,500 training steps compared to training from scratch. To our knowledge, this is the first systematic investigation of tokeniser pretraining for physics foundation models. We further introduce flexible spatiotemporal compression operations that extend causal convolutions to support runtime-adjustable compression ratios, enabling efficient adaptation to diverse downstream tasks. Our findings provide practical guidance for training efficient physics emulators and highlight the importance of strategic pretraining data selection.

On the Value of Tokeniser Pretraining in Physics Foundation Models

TL;DR

Abstract

Paper Structure (29 sections, 10 equations, 4 figures, 7 tables)

This paper contains 29 sections, 10 equations, 4 figures, 7 tables.

Introduction
Methods
Data
Experiments and Evaluation Criteria
Training Objective
Architecture
Parameter Counts and Training Recipe
Results
Training Cost
Downstream Performance
Freezing Strategies
Discussion
Key Findings
Limitations
Future Directions
...and 14 more sections

Figures (4)

Figure 1: Experiments setup.
Figure 2: Next-frame prediction learning curves on the validation set over 29,400 training steps for different pretraining and freezing configurations. Each panel shows a different evaluation metric: VRMSE (leftmost), and spectral error at low, mid, and high frequency ranges.
Figure 3: Autoregressive rollout learning curves over 210,000 training steps at different prediction horizons (single run). Each panel shows VRMSE for a different rollout window. The shaded region indicates the training range shown in Figure \ref{['fig:validation_results']}.
Figure 4: MAGVIT-Simple architecture.

On the Value of Tokeniser Pretraining in Physics Foundation Models

TL;DR

Abstract

On the Value of Tokeniser Pretraining in Physics Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)