A Multimodal PDE Foundation Model for Prediction and Scientific Text Descriptions
Elisa Negrini, Yuxuan Liu, Liu Yang, Stanley J. Osher, Hayden Schaeffer
TL;DR
The paper tackles the challenge of PDE foundation modeling across multiple equations by introducing a transformer-based multimodal framework that jointly processes numerical inputs and textual descriptions. A custom multimodal tokenizer pairs text (via a GPT-2 backbone) with numerical data (via an MLP) and uses a cross-attention decoder to output numerical solution operators alongside an autoregressive text generator for scientific descriptions. The approach achieves strong numerical accuracy (average in-distribution error $<3.3\%$, out-of-distribution $<7.8\%$) and high-quality text descriptions (BERTScore/F1 $>0.93$), and demonstrates time extrapolation capabilities on several equation classes. These results highlight the framework’s potential for interpretable, multimodal PDE foundation modeling with robust generalization, offering a path toward integrated numerical and textual scientific reasoning.
Abstract
Neural networks are one tool for approximating non-linear differential equations used in scientific computing tasks such as surrogate modeling, real-time predictions, and optimal control. PDE foundation models utilize neural networks to train approximations to multiple differential equations simultaneously and are thus a general purpose solver that can be adapted to downstream tasks. Current PDE foundation models focus on either learning general solution operators and/or the governing system of equations, and thus only handle numerical or symbolic modalities. However, real-world applications may require more flexible data modalities, e.g. text analysis or descriptive outputs. To address this gap, we propose a novel multimodal deep learning approach that leverages a transformer-based architecture to approximate solution operators for a wide variety of ODEs and PDEs. Our method integrates numerical inputs, such as equation parameters and initial conditions, with text descriptions of physical processes or system dynamics. This enables our model to handle settings where symbolic representations may be incomplete or unavailable. In addition to providing accurate numerical predictions, our approach generates interpretable scientific text descriptions, offering deeper insights into the underlying dynamics and solution properties. The numerical experiments show that our model provides accurate solutions for in-distribution data (with average relative error less than 3.3%) and out-of-distribution data (average relative error less than 7.8%) together with precise text descriptions (with correct descriptions generated 100% of times). In certain tests, the model is also shown to be capable of extrapolating solutions in time.
