MaD-Scientist: AI-based Scientist solving Convection-Diffusion-Reaction Equations Using Massive PINN-Based Prior Data
Mingu Kang, Dongseok Lee, Woojin Cho, Jaehyeon Park, Kookjin Lee, Anthony Gruber, Youngjoon Hong, Noseong Park
TL;DR
MaD-Scientist presents a scientific foundation model that leverages in-context learning and Bayesian-style priors to predict PDE solutions from noisy, low-cost prior data. By constructing a PINN-based prior data space and training a Transformer to perform zero-shot inference, the approach demonstrates robust solution prediction for the 1D convection–diffusion–reaction equation across multiple reaction terms and data-noise scenarios. The key contributions are the demonstration that approximated priors can support effective pre-training of SFMs, the integration of PINN priors with in-context learning, and the observed superconvergence where inaccurate priors yield highly accurate predictions. This offers a scalable pathway to pre-train SFMs with realistic, low-cost data, enabling broad applicability in settings where governing equations are unknown or vary over time, with potential impact on rapid PDE solving and scientific discovery.
Abstract
Large language models (LLMs), like ChatGPT, have shown that even trained with noisy prior data, they can generalize effectively to new tasks through in-context learning (ICL) and pre-training techniques. Motivated by this, we explore whether a similar approach can be applied to scientific foundation models (SFMs). Our methodology is structured as follows: (i) we collect low-cost physics-informed neural network (PINN)-based approximated prior data in the form of solutions to partial differential equations (PDEs) constructed through an arbitrary linear combination of mathematical dictionaries; (ii) we utilize Transformer architectures with self and cross-attention mechanisms to predict PDE solutions without knowledge of the governing equations in a zero-shot setting; (iii) we provide experimental evidence on the one-dimensional convection-diffusion-reaction equation, which demonstrate that pre-training remains robust even with approximated prior data, with only marginal impacts on test accuracy. Notably, this finding opens the path to pre-training SFMs with realistic, low-cost data instead of (or in conjunction with) numerical high-cost data. These results support the conjecture that SFMs can improve in a manner similar to LLMs, where fully cleaning the vast set of sentences crawled from the Internet is nearly impossible.
