Pre-Generating Multi-Difficulty PDE Data for Few-Shot Neural PDE Solvers
Naman Choudhary, Vedant Singh, Ameet Talwalkar, Nicholas Matthew Boffi, Mikhail Khodak, Tanya Marwah
TL;DR
This work investigates how the composition of training data by difficulty affects the performance of neural PDE solvers. By systematically varying geometry and Reynolds-number difficulty in 2D incompressible Navier–Stokes simulations and pre-generating datasets at easy, medium, and hard levels, the authors show that incorporating a small amount of lower-difficulty data significantly boosts performance on hard distributions, enabling substantial compute savings. Across supervised operators (CNO, F-FNO) and large pretrained models (Poseidon variants), mixing about 10% hard examples with easy/medium data recovers most of the hard-data performance, with medium-difficulty data often offering the best cost-to-performance tradeoffs. They also demonstrate the potential of foundation-like datasets to amortize pre-generation costs across multiple downstream tasks, motivating data-centric strategies for scalable neural PDE solvers.
Abstract
A key aspect of learned partial differential equation (PDE) solvers is that the main cost often comes from generating training data with classical solvers rather than learning the model itself. Another is that there are clear axes of difficulty--e.g., more complex geometries and higher Reynolds numbers--along which problems become (1) harder for classical solvers and thus (2) more likely to benefit from neural speedups. Towards addressing this chicken-and-egg challenge, we study difficulty transfer on 2D incompressible Navier-Stokes, systematically varying task complexity along geometry (number and placement of obstacles), physics (Reynolds number), and their combination. Similar to how it is possible to spend compute to pre-train foundation models and improve their performance on downstream tasks, we find that by classically solving (analogously pre-generating) many low and medium difficulty examples and including them in the training set, it is possible to learn high-difficulty physics from far fewer samples. Furthermore, we show that by combining low and high difficulty data, we can spend 8.9x less compute on pre-generating a dataset to achieve the same error as using only high difficulty examples. Our results highlight that how we allocate classical-solver compute across difficulty levels is as important as how much we allocate overall, and suggest substantial gains from principled curation of pre-generated PDE data for neural solvers. Our code is available at https://github.com/Naman-Choudhary-AI-ML/pregenerating-pde
