Table of Contents
Fetching ...

Pre-Generating Multi-Difficulty PDE Data for Few-Shot Neural PDE Solvers

Naman Choudhary, Vedant Singh, Ameet Talwalkar, Nicholas Matthew Boffi, Mikhail Khodak, Tanya Marwah

TL;DR

This work investigates how the composition of training data by difficulty affects the performance of neural PDE solvers. By systematically varying geometry and Reynolds-number difficulty in 2D incompressible Navier–Stokes simulations and pre-generating datasets at easy, medium, and hard levels, the authors show that incorporating a small amount of lower-difficulty data significantly boosts performance on hard distributions, enabling substantial compute savings. Across supervised operators (CNO, F-FNO) and large pretrained models (Poseidon variants), mixing about 10% hard examples with easy/medium data recovers most of the hard-data performance, with medium-difficulty data often offering the best cost-to-performance tradeoffs. They also demonstrate the potential of foundation-like datasets to amortize pre-generation costs across multiple downstream tasks, motivating data-centric strategies for scalable neural PDE solvers.

Abstract

A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty--e.g., more complex geometries and higher Reynolds numbers--along which problems become (1) harder for classical solvers and thus (2) more likely to benefit from neural speedups. Towards addressing this chicken-and-egg challenge, we study difficulty transfer on 2D incompressible Navier-Stokes, systematically varying task complexity along geometry (number and placement of obstacles), physics (Reynolds number), and their combination. Similar to how it is possible to spend compute to pre-train foundation models and improve their performance on downstream tasks, we find that by classically solving (analogously pre-generating) many low and medium difficulty examples and including them in the training set, it is possible to learn high-difficulty physics from far fewer samples. Furthermore, we show that by combining low and high difficulty data, we can spend 8.9x less compute on pre-generating a dataset to achieve the same error as using only high difficulty examples. Our results highlight that how we allocate classical-solver compute across difficulty levels is as important as how much we allocate overall, and suggest substantial gains from principled curation of pre-generated PDE data for neural solvers. Our code is available at https://github.com/Naman-Choudhary-AI-ML/pregenerating-pde

Pre-Generating Multi-Difficulty PDE Data for Few-Shot Neural PDE Solvers

TL;DR

This work investigates how the composition of training data by difficulty affects the performance of neural PDE solvers. By systematically varying geometry and Reynolds-number difficulty in 2D incompressible Navier–Stokes simulations and pre-generating datasets at easy, medium, and hard levels, the authors show that incorporating a small amount of lower-difficulty data significantly boosts performance on hard distributions, enabling substantial compute savings. Across supervised operators (CNO, F-FNO) and large pretrained models (Poseidon variants), mixing about 10% hard examples with easy/medium data recovers most of the hard-data performance, with medium-difficulty data often offering the best cost-to-performance tradeoffs. They also demonstrate the potential of foundation-like datasets to amortize pre-generation costs across multiple downstream tasks, motivating data-centric strategies for scalable neural PDE solvers.

Abstract

A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty--e.g., more complex geometries and higher Reynolds numbers--along which problems become (1) harder for classical solvers and thus (2) more likely to benefit from neural speedups. Towards addressing this chicken-and-egg challenge, we study difficulty transfer on 2D incompressible Navier-Stokes, systematically varying task complexity along geometry (number and placement of obstacles), physics (Reynolds number), and their combination. Similar to how it is possible to spend compute to pre-train foundation models and improve their performance on downstream tasks, we find that by classically solving (analogously pre-generating) many low and medium difficulty examples and including them in the training set, it is possible to learn high-difficulty physics from far fewer samples. Furthermore, we show that by combining low and high difficulty data, we can spend 8.9x less compute on pre-generating a dataset to achieve the same error as using only high difficulty examples. Our results highlight that how we allocate classical-solver compute across difficulty levels is as important as how much we allocate overall, and suggest substantial gains from principled curation of pre-generated PDE data for neural solvers. Our code is available at https://github.com/Naman-Choudhary-AI-ML/pregenerating-pde

Paper Structure

This paper contains 33 sections, 13 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Top: vorticity snapshots across increasing geometry difficulty, with flows past zero, one, and multiple (2–10) square obstacles. Bottom: snapshots across physics difficulties in the form of low ($[100,1000]$), medium ($[2000,4000]$), and high ($[8000,10000]$) Re bands.
  • Figure 2: FPO with objects from the FlowBench G1 NURBS data tali2024flowbenchlargescalebenchmark.
  • Figure 3: Computational cost of simulating flow past an object (FPO) at different difficulty settings, demonstrating increasing runtime along both the domain geometry axis (increasing number of obstacles) and the physics axis (increasing Reynolds number). The costs reported are averages across thirty simulations.
  • Figure 4: Performance on hard (high Re) examples while varying the data composition. We fix the total number of training examples to 800 and show the error of various models as the fraction of the data consisting of high Re ($\in[8000,10000]$) examples increases. Here the easy examples and medium examples are low Re ($\in[100,1000]$) and medium Re ($\in[2000,4000]$), respectively. The two row evaluates supervised models on no-obstacle FPO (left) and LDC (right), the bottom left evaluates supervised models on flows past multiple objects, and the right evaluates multiple Poseidon FMs on flows past multiple objects. Across all results we observe that a small fraction of lower difficulty examples is able to recover much of the performance of neural PDE solvers trained on solely hard (target) examples.
  • Figure 5: Performance on hard (multi-obstacle) FPO while varying data composition. The total number of training examples is fixed to 800 and we evaluate using varying fractions of zero obstacle (easy) and single obstacle (medium) simulations in the training data. As with varying Re, for both supervised models (left) and Poseidon FMs (right), a small number of lower difficulty examples suffices to recover most of the performance of models trained on entirely hard examples.
  • ...and 12 more figures