Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior
Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael Mahoney, Amir Gholami
TL;DR
The paper investigates building foundation models for scientific machine learning by pre‑training a Fourier Neural Operator on a diverse set of PDE problems (Poisson, Advection–Diffusion, Helmholtz) and transferring to downstream tasks via zero‑shot and few‑shot fine‑tuning. It conducts a systematic scaling analysis over model size ($64K$ to $256M$ parameters), downstream data size, and shifts in physics parameters, including a mixed‑operator pre‑training regime. The results show substantial data efficiency gains from pre‑training, with larger models yielding greater transfer benefits, and demonstrate that a single mixed‑operator pre‑trained model can adapt to multiple PDE systems, including some OOD scenarios; these gains persist though they weaken as downstream data become plentiful. The work provides a concrete path toward SciML foundation models and releases code to enable reproducibility and further research in multi‑task, data‑efficient PDE learning.
Abstract
Pre-trained machine learning (ML) models have shown great performance for a wide range of applications, in particular in natural language processing (NLP) and computer vision (CV). Here, we study how pre-training could be used for scientific machine learning (SciML) applications, specifically in the context of transfer learning. We study the transfer behavior of these models as (i) the pre-trained model size is scaled, (ii) the downstream training dataset size is scaled, (iii) the physics parameters are systematically pushed out of distribution, and (iv) how a single model pre-trained on a mixture of different physics problems can be adapted to various downstream applications. We find that-when fine-tuned appropriately-transfer learning can help reach desired accuracy levels with orders of magnitude fewer downstream examples (across different tasks that can even be out-of-distribution) than training from scratch, with consistent behavior across a wide range of downstream examples. We also find that fine-tuning these models yields more performance gains as model size increases, compared to training from scratch on new downstream tasks. These results hold for a broad range of PDE learning tasks. All in all, our results demonstrate the potential of the "pre-train and fine-tune" paradigm for SciML problems, demonstrating a path towards building SciML foundation models. We open-source our code for reproducibility.
