Table of Contents
Fetching ...

MATEY: multiscale adaptive foundation models for spatiotemporal physical systems

Pei Zhang, M. Paul Laiu, Matthew Norman, Doug Stefanski, John Gounley

TL;DR

MATEY tackles the bottleneck of representing multiscale spatiotemporal physics with vision transformers by introducing adaptive tokenization and axial/spatiotemporal attention schemes. It combines a multi-physics preprocessor/postprocessor with two adaptive tokenization modes (Adap_Mul, Adap_Mix) and three attention variants (AViT, SViT, ViT) to reduce compute while preserving accuracy, enabling efficient training on long sequences. Pretraining on PDEBench followed by finetuning on out-of-distribution tasks (colliding thermals and MHD) shows pretrained models can outperform random initializations, particularly in low-data regimes, though the gains depend on downstream physics similarity. The work highlights that SViT offers a practical balance of efficiency and accuracy, while adaptive tokenization robustly handles high-resolution multiscale data, suggesting a viable path toward practical, physics-informed foundation models.

Abstract

Accurate representation of the multiscale features in spatiotemporal physical systems using vision transformer (ViT) architectures requires extremely long, computationally prohibitive token sequences. To address this issue, we propose two adaptive tokenization schemes that dynamically adjust patch sizes based on local features: one ensures convergent behavior to uniform patch refinement, while the other offers better computational efficiency. Moreover, we present a set of spatiotemporal attention schemes, where the temporal or axial spatial dimensions are decoupled, and evaluate their computational and data efficiencies. We assess the performance of the proposed multiscale adaptive model, MATEY, in a sequence of experiments. The results show that adaptive tokenization schemes achieve improved accuracy without significantly increasing the length of the token sequence. Compared to a full spatiotemporal attention scheme or a scheme that decouples only the temporal dimension, we find that fully decoupled axial attention is less efficient and expressive, requiring more training time and model weights to achieve the same accuracy. Finally, we demonstrate in two fine-tuning tasks featuring different physics that models pretrained on PDEBench data outperform the ones trained from scratch, especially in the low data regime with frozen attention.

MATEY: multiscale adaptive foundation models for spatiotemporal physical systems

TL;DR

MATEY tackles the bottleneck of representing multiscale spatiotemporal physics with vision transformers by introducing adaptive tokenization and axial/spatiotemporal attention schemes. It combines a multi-physics preprocessor/postprocessor with two adaptive tokenization modes (Adap_Mul, Adap_Mix) and three attention variants (AViT, SViT, ViT) to reduce compute while preserving accuracy, enabling efficient training on long sequences. Pretraining on PDEBench followed by finetuning on out-of-distribution tasks (colliding thermals and MHD) shows pretrained models can outperform random initializations, particularly in low-data regimes, though the gains depend on downstream physics similarity. The work highlights that SViT offers a practical balance of efficiency and accuracy, while adaptive tokenization robustly handles high-resolution multiscale data, suggesting a viable path toward practical, physics-informed foundation models.

Abstract

Accurate representation of the multiscale features in spatiotemporal physical systems using vision transformer (ViT) architectures requires extremely long, computationally prohibitive token sequences. To address this issue, we propose two adaptive tokenization schemes that dynamically adjust patch sizes based on local features: one ensures convergent behavior to uniform patch refinement, while the other offers better computational efficiency. Moreover, we present a set of spatiotemporal attention schemes, where the temporal or axial spatial dimensions are decoupled, and evaluate their computational and data efficiencies. We assess the performance of the proposed multiscale adaptive model, MATEY, in a sequence of experiments. The results show that adaptive tokenization schemes achieve improved accuracy without significantly increasing the length of the token sequence. Compared to a full spatiotemporal attention scheme or a scheme that decouples only the temporal dimension, we find that fully decoupled axial attention is less efficient and expressive, requiring more training time and model weights to achieve the same accuracy. Finally, we demonstrate in two fine-tuning tasks featuring different physics that models pretrained on PDEBench data outperform the ones trained from scratch, especially in the low data regime with frozen attention.
Paper Structure (30 sections, 14 equations, 16 figures, 4 tables)

This paper contains 30 sections, 14 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: MATEY: multiscale adaptive foundation models for spatiotemporal physical systems.
  • Figure 2: Adaptive tokenization that dynamically adjusts patch sizes based on local features. There are three essential parameters: $[p_{x_1}, p_{y_1}]$, $[p_{x_{\textup{sts}}}, p_{y_{\textup{sts}}}]$ and $\gamma_{\textup{sts}}$. The parameter $[p_{x_1}, p_{y_1}]$ denotes the initial coarse patch size, $[p_{x_{\textup{sts}}}, p_{y_{\textup{sts}}}]$ represents the refined patch size, and $\gamma_\textup{sts}\in[0,1]$ determines which patches to refine. We select patches with local variances greater than $\gamma_\textup{sts}$ times the maximum variance across all patches (see Equation (\ref{['eq-ind']})).
  • Figure 3: Learning efficiency of AViT, SViT, and ViT at three model sizes regarding final predictive error and training time cost: SViT and ViT are observed to be more expressive and computationally efficient than AViT in the experiment, as they require fewer model parameters and less training time to achieve the same test accuracy.
  • Figure 4: Predicted temperature contours at $t=590$ from Ti-SViT models with constant patch sizes ps=$16\times16$ and ps=$32\times32$ and adaptive tokenization (Adap_Mul with $p_{x_1} =p_{y_1}=32$, $p_{x_{\textup{sts}}}=p_{y_{\textup{sts}}}=16$ , and $\gamma_\textup{sts}=0.2$). Adap_Mul predicts smoother, finer local structures that are overlooked in ps=$32\times32$, similar to the more expensive ps=$16\times16$.
  • Figure 5: Final NRMSE loss for Tiny ViT (left) and SViT (right) with adaptive tokenization --- Adap_Mix with hyperparamters ($p_{x_1}$, $p_{x_{\textup{sts}}}$, $\gamma_\textup{sts}$)--- and constant patch sizes against average sequence length, $L_\textup{avg,mix}$ (Equation (\ref{['eq-lmix']})). Error bars, representing standard deviations from 3 runs, are shown for ViT. Adap_Mix with $\gamma_\textup{sts}$ varying from 1.0 to 0.0 shows a clear convergent transition from the coarse constant patch size ps=$p_{x_1}\times p_{y_1}$ to the fine constant patch size ps=$p_{x_{\textup{sts}}}\times p_{y_{\textup{sts}}}$. More interestingly, Adap_Mix is shown to achieve lower prediction errors than the more expensive ps=$p_{x_{\textup{sts}}}\times p_{x_{\textup{sts}}}$ cases despite requiring only half of the average sequence length.
  • ...and 11 more figures