Table of Contents
Fetching ...

Self-supervised Pretraining for Partial Differential Equations

Varun Madhavan, Amal S Sebastian, Bharath Ramsundar, Venkatasubramanian Viswanathan

TL;DR

This work describes a novel approach to building a neural PDE solver leveraging recent advances in transformer based neural network architectures, and demonstrates that the model can generalize over the space of PDE parameters, despite having a higher prediction error for individual parameter values compared to the FNO.

Abstract

In this work, we describe a novel approach to building a neural PDE solver leveraging recent advances in transformer based neural network architectures. Our model can provide solutions for different values of PDE parameters without any need for retraining the network. The training is carried out in a self-supervised manner, similar to pretraining approaches applied in language and vision tasks. We hypothesize that the model is in effect learning a family of operators (for multiple parameters) mapping the initial condition to the solution of the PDE at any future time step t. We compare this approach with the Fourier Neural Operator (FNO), and demonstrate that it can generalize over the space of PDE parameters, despite having a higher prediction error for individual parameter values compared to the FNO. We show that performance on a specific parameter can be improved by finetuning the model with very small amounts of data. We also demonstrate that the model scales with data as well as model size.

Self-supervised Pretraining for Partial Differential Equations

TL;DR

This work describes a novel approach to building a neural PDE solver leveraging recent advances in transformer based neural network architectures, and demonstrates that the model can generalize over the space of PDE parameters, despite having a higher prediction error for individual parameter values compared to the FNO.

Abstract

In this work, we describe a novel approach to building a neural PDE solver leveraging recent advances in transformer based neural network architectures. Our model can provide solutions for different values of PDE parameters without any need for retraining the network. The training is carried out in a self-supervised manner, similar to pretraining approaches applied in language and vision tasks. We hypothesize that the model is in effect learning a family of operators (for multiple parameters) mapping the initial condition to the solution of the PDE at any future time step t. We compare this approach with the Fourier Neural Operator (FNO), and demonstrate that it can generalize over the space of PDE parameters, despite having a higher prediction error for individual parameter values compared to the FNO. We show that performance on a specific parameter can be improved by finetuning the model with very small amounts of data. We also demonstrate that the model scales with data as well as model size.
Paper Structure (14 sections, 10 equations, 6 figures, 2 tables)

This paper contains 14 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: PDE Transformer: The fluid transformer works by treating the context window of $N$ time steps as an image with channels containing information about the PDE parameters and time. This image is passed through a convolution layer to break the image into patches of dimension $[\text{px, pt}]$ and the projected into a tokens of size $h$. After passing through L transformer encoder layers (shown on the right) we reconstruct the image through a transposed convolution, from which we extract the state at time $N+1$.
  • Figure 2: In domain and out of domain scaling with more data: The above plot shows how the prediction MSE varies with an increase in the training dataset size (i.e. the number of trajectories per parameter value $\omega$). In the in domain setting we observe that the prediction error decreases steadily with an increase in the dataset size for all systems. In the out of domain setting for Burgers, we observe that the prediction error decreases steadily with an increase in the dataset size. For advection, the prediction error decreases slowly from 200 trajectories/$\omega$ onwards. The models could be under-fitting leading do this trend. For the 2D-NSE, however, no discernible trend is observed. This again indicates that the models are possibly under-fitting the training sets, thus adding more data does not improve performance.
  • Figure 3: Finetuning a pretrained model: Finetuning a pretrained model for the Burgers equation on the out of domain parameters helps to improve performance for most parameter values. Not that for most parameter values, the pretrained model substantially outperforms random initialization.
  • Figure 4: In domain and out of domain scaling with model size: The above plot shows how the in-domain prediction MSE varies with an increase in the model size (i.e. the number of learnable parameters). We observe that the prediction error decreases monotonically with an increase in the model size for all systems in the in domain setting. In the out of domain setting, we observe that the prediction error for Burgers and 2D-NSE decreases steadily with increasing size, however the error for advection actually increases. This anomaly for advection might be due to the larger models overfitting the training data in advection.
  • Figure 5: FNO Scaling of Advection and Burgers equations: On scaling the number of trajectories in the dataset for the FNO, we see that the performance remains nearly the same for both the Advection and Burgers equations, suggesting that the FNO learns quickly from little data.
  • ...and 1 more figures