Learning where to learn: Training data distribution optimization for scientific machine learning
Nicolas Guerra, Nicholas H. Nelsen, Yunan Yang
TL;DR
This work tackles the challenge of distribution shift in scientific machine learning by proposing to design the training data distribution itself, ν, to minimize average deployment error across a family of regimes. It develops two principled optimization approaches—a bilevel framework in RKHS and an alternating, upper-bound-based scheme—that operate over probability measures and can be implemented with parametric or nonparametric (particle-based) representations. Theoretical Lipschitz-based OOD bounds and average-case performance analyses inform the algorithms, while numerical experiments on function approximation and PDE operator learning (including EIT, Darcy flow, radiative transport, and Burgers) demonstrate significant reductions in out-of-distribution error and improved sample efficiency. The work highlights intelligent data acquisition as a core component of SciML workflows and provides practical methods to tailor training data to complex deployment regimes, with extensible architectures and publicly available code for reproducibility.
Abstract
In scientific machine learning, models are routinely deployed with parameter values or boundary conditions far from those used in training. This paper studies the learning-where-to-learn problem of designing a training data distribution that minimizes average prediction error across a family of deployment regimes. A theoretical analysis shows how the training distribution shapes deployment accuracy. This motivates two adaptive algorithms based on bilevel or alternating optimization in the space of probability measures. Discretized implementations using parametric distribution classes or nonparametric particle-based gradient flows deliver optimized training distributions that outperform nonadaptive designs. Once trained, the resulting models exhibit improved sample complexity and robustness to distribution shift. This framework unlocks the potential of principled data acquisition for learning functions and solution operators of partial differential equations.
