Table of Contents
Fetching ...

A Multimodal PDE Foundation Model for Prediction and Scientific Text Descriptions

Elisa Negrini, Yuxuan Liu, Liu Yang, Stanley J. Osher, Hayden Schaeffer

TL;DR

The paper tackles the challenge of PDE foundation modeling across multiple equations by introducing a transformer-based multimodal framework that jointly processes numerical inputs and textual descriptions. A custom multimodal tokenizer pairs text (via a GPT-2 backbone) with numerical data (via an MLP) and uses a cross-attention decoder to output numerical solution operators alongside an autoregressive text generator for scientific descriptions. The approach achieves strong numerical accuracy (average in-distribution error $<3.3\%$, out-of-distribution $<7.8\%$) and high-quality text descriptions (BERTScore/F1 $>0.93$), and demonstrates time extrapolation capabilities on several equation classes. These results highlight the framework’s potential for interpretable, multimodal PDE foundation modeling with robust generalization, offering a path toward integrated numerical and textual scientific reasoning.

Abstract

Neural networks are one tool for approximating non-linear differential equations used in scientific computing tasks such as surrogate modeling, real-time predictions, and optimal control. PDE foundation models utilize neural networks to train approximations to multiple differential equations simultaneously and are thus a general purpose solver that can be adapted to downstream tasks. Current PDE foundation models focus on either learning general solution operators and/or the governing system of equations, and thus only handle numerical or symbolic modalities. However, real-world applications may require more flexible data modalities, e.g. text analysis or descriptive outputs. To address this gap, we propose a novel multimodal deep learning approach that leverages a transformer-based architecture to approximate solution operators for a wide variety of ODEs and PDEs. Our method integrates numerical inputs, such as equation parameters and initial conditions, with text descriptions of physical processes or system dynamics. This enables our model to handle settings where symbolic representations may be incomplete or unavailable. In addition to providing accurate numerical predictions, our approach generates interpretable scientific text descriptions, offering deeper insights into the underlying dynamics and solution properties. The numerical experiments show that our model provides accurate solutions for in-distribution data (with average relative error less than 3.3%) and out-of-distribution data (average relative error less than 7.8%) together with precise text descriptions (with correct descriptions generated 100% of times). In certain tests, the model is also shown to be capable of extrapolating solutions in time.

A Multimodal PDE Foundation Model for Prediction and Scientific Text Descriptions

TL;DR

The paper tackles the challenge of PDE foundation modeling across multiple equations by introducing a transformer-based multimodal framework that jointly processes numerical inputs and textual descriptions. A custom multimodal tokenizer pairs text (via a GPT-2 backbone) with numerical data (via an MLP) and uses a cross-attention decoder to output numerical solution operators alongside an autoregressive text generator for scientific descriptions. The approach achieves strong numerical accuracy (average in-distribution error , out-of-distribution ) and high-quality text descriptions (BERTScore/F1 ), and demonstrates time extrapolation capabilities on several equation classes. These results highlight the framework’s potential for interpretable, multimodal PDE foundation modeling with robust generalization, offering a path toward integrated numerical and textual scientific reasoning.

Abstract

Neural networks are one tool for approximating non-linear differential equations used in scientific computing tasks such as surrogate modeling, real-time predictions, and optimal control. PDE foundation models utilize neural networks to train approximations to multiple differential equations simultaneously and are thus a general purpose solver that can be adapted to downstream tasks. Current PDE foundation models focus on either learning general solution operators and/or the governing system of equations, and thus only handle numerical or symbolic modalities. However, real-world applications may require more flexible data modalities, e.g. text analysis or descriptive outputs. To address this gap, we propose a novel multimodal deep learning approach that leverages a transformer-based architecture to approximate solution operators for a wide variety of ODEs and PDEs. Our method integrates numerical inputs, such as equation parameters and initial conditions, with text descriptions of physical processes or system dynamics. This enables our model to handle settings where symbolic representations may be incomplete or unavailable. In addition to providing accurate numerical predictions, our approach generates interpretable scientific text descriptions, offering deeper insights into the underlying dynamics and solution properties. The numerical experiments show that our model provides accurate solutions for in-distribution data (with average relative error less than 3.3%) and out-of-distribution data (average relative error less than 7.8%) together with precise text descriptions (with correct descriptions generated 100% of times). In certain tests, the model is also shown to be capable of extrapolating solutions in time.

Paper Structure

This paper contains 21 sections, 39 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Model Illustration. Our model processes multimodal input, where textual prompts describe equations, and numerical values represent initial conditions and parameters. A custom tokenizer encodes text using a GPT-2 tokenizer and numerical inputs via an MLP. The tokenized sequence is processed by an LLM backbone, followed by two decoding pathways: a transformer decoder for text generation and a data decoder with cross-attention to construct the operator.
  • Figure 2: Comparison of Outputs: For PDE examples, we show from left to right the ground truth solution, the predicted solution, and the absolute difference. First row: from left to right 1D ODE (index 4), 2D ODE (index 12, Duffing system), 3D ODE (index 7, Neural Dynamics). Second row: PDE (index 18, Korteweg De Vries equation). Third row: Conservation Law (index 25, Burgers' equation). Fourth row: Conservation Law with rarefaction (index 47, Inviscid Conservation law Cubic Flux); Fifth row: Conservation Law with shock (index 43, Inviscid Conservation law Cosine Flux)
  • Figure 3: Example of extrapolation in time. The model is trained on data generated for $t\in[0,5]$ and tasked to predict the solution for $t\in [5,10]$. From left to right, ground truth solution, predicted solution, absolute difference. First row: Heat Equation (index 13), Second row: Advection Equation (index 19). Third row: Diffusion-reaction Square Logistic (index 24). Fourth row: Inviscid Conservation law Cubic Flux (index 29) ; Fifth row: Conservation law Sine Flux (index 30)