Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior

Shashank Subramanian; Peter Harrington; Kurt Keutzer; Wahid Bhimji; Dmitriy Morozov; Michael Mahoney; Amir Gholami

Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior

Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael Mahoney, Amir Gholami

TL;DR

The paper investigates building foundation models for scientific machine learning by pre‑training a Fourier Neural Operator on a diverse set of PDE problems (Poisson, Advection–Diffusion, Helmholtz) and transferring to downstream tasks via zero‑shot and few‑shot fine‑tuning. It conducts a systematic scaling analysis over model size ($64K$ to $256M$ parameters), downstream data size, and shifts in physics parameters, including a mixed‑operator pre‑training regime. The results show substantial data efficiency gains from pre‑training, with larger models yielding greater transfer benefits, and demonstrate that a single mixed‑operator pre‑trained model can adapt to multiple PDE systems, including some OOD scenarios; these gains persist though they weaken as downstream data become plentiful. The work provides a concrete path toward SciML foundation models and releases code to enable reproducibility and further research in multi‑task, data‑efficient PDE learning.

Abstract

Pre-trained machine learning (ML) models have shown great performance for a wide range of applications, in particular in natural language processing (NLP) and computer vision (CV). Here, we study how pre-training could be used for scientific machine learning (SciML) applications, specifically in the context of transfer learning. We study the transfer behavior of these models as (i) the pre-trained model size is scaled, (ii) the downstream training dataset size is scaled, (iii) the physics parameters are systematically pushed out of distribution, and (iv) how a single model pre-trained on a mixture of different physics problems can be adapted to various downstream applications. We find that-when fine-tuned appropriately-transfer learning can help reach desired accuracy levels with orders of magnitude fewer downstream examples (across different tasks that can even be out-of-distribution) than training from scratch, with consistent behavior across a wide range of downstream examples. We also find that fine-tuning these models yields more performance gains as model size increases, compared to training from scratch on new downstream tasks. These results hold for a broad range of PDE learning tasks. All in all, our results demonstrate the potential of the "pre-train and fine-tune" paradigm for SciML problems, demonstrating a path towards building SciML foundation models. We open-source our code for reproducibility.

Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior

TL;DR

parameters), downstream data size, and shifts in physics parameters, including a mixed‑operator pre‑training regime. The results show substantial data efficiency gains from pre‑training, with larger models yielding greater transfer benefits, and demonstrate that a single mixed‑operator pre‑trained model can adapt to multiple PDE systems, including some OOD scenarios; these gains persist though they weaken as downstream data become plentiful. The work provides a concrete path toward SciML foundation models and releases code to enable reproducibility and further research in multi‑task, data‑efficient PDE learning.

Abstract

Paper Structure (15 sections, 4 equations, 11 figures, 1 table)

This paper contains 15 sections, 4 equations, 11 figures, 1 table.

Introduction
Related work
Methods
Results
Conclusions
Appendix: Additional Details
Pre-train and downstream data creation
Model architecture
Training details and code open-source
Hyperparameter tuning
Input normalization
Additional Results
TL behavior over underlying physics
TL behavior underlying multiple operators
Sensitivity to random seeds

Figures (11)

Figure 1: Our setup consists of creating diverse training datasets, sampling both PDE coefficients and source functions simultaneously with different PDE operators and input data (coefficients, sources) distributions for pre-training. A neural operator is then pre-trained to predict the PDE solutions given these inputs and the ground truth solutions (computed through PDE solvers). The pre-trained model is then adapted with minimal fine-tuning (zero-shot or few-shot), and it is used in various downstream tasks (PDE systems) that can be in-domain or out-of-domain from the pre-training datasets. The pre-training with multiple solution operators allows the same model to transfer to several very different systems. For instance, PDE 2 (Helmholtz) manifests highly oscillatory solutions compared to, say, PDE 1 (Advection-Diffusion) or PDE 3 (Poisson's). We further characterize the scaling and transfer properties of this model as a function of downstream data scale and model size scale.
Figure 2: Visualization of the source function sampling (left) and the effect of certain PDE coefficients (right) on the solutions for the different systems. On the left, as we go down, the sparsity of Gaussians is increasing, leading to more sparse and spread out source functions encouraging heterogeneity in the dataset. For one of these source functions, we apply the different PDE operators with varying ranges of certaino PDE coefficients to illustrate their effect on the solutions. On the top row, for SYS-1 (Poisson's), we show that by increasing the diffusion tensor eigenvalue $e$ (but keeping the direction $\theta$ fixed), we increasing anisotropy and diffusion as we move towards the right. In the middle, we increase the velocity scales for SYS-2 (Advection-Diffusion), but keep the diffusion tensor and velocity direction the same, to demonstrate the increasing competing advection and diffusion processes as we go right. Finally, at the bottom, we show the highly oscillatory behavior in SYS-3 (Helmholtz) as we increase the wavenumber $\omega$. Note the significant differences between the solutions of the different systems.
Figure 3: Addressing (Q1). Testing error as a function of downstream examples for SYS-1 and SYS-2. We visualize the distribution of pre-training and downstream dataset physics at the top to illustrate (and quantifiy) the extent of distributional shifts. We observe excellent zero-shot and few-shot TL performance of the pre-trained model despite the modest OOD shifts and in medium-data regimes about $100\times$ increase in data efficiency. We observe diminishing returns from pre-training at the large-data regime (O($2^{15}$) examples), which has as many examples as used in pre-training.
Figure 4: Addressing (Q2). Model size scaling for SYS-1 and SYS-2 from $64K$ to $256M$ parameters for medium OOD test-cases. While finetuning consistently improves the model performance and data efficiency, we observe higher errors for small parameter regimes at $64K$ due to insufficient model capacity. The performance gains are significantly boosted through finetuning with a larger model set sizes monotonically up to $256M$ parameters.
Figure 5: Addressing (Q3). Testing error as a function of downstream examples for different downstream tasks used in SYS-1. We show the extent of overlap (signifying distributional shifts) between the pre-trained and downstream dataset at the top using the range of sampled diffusion tensor eigenvalue. For datasets within distribution, zero-shot TL is optimal. As the downstream dataset shifts moderately OOD, the zero-shot learning suffers gradually and is recovered through fine-tuning. This recovery is slower as the distributional shifts increase.
...and 6 more figures

Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior

TL;DR

Abstract

Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior

Authors

TL;DR

Abstract

Table of Contents

Figures (11)