Table of Contents
Fetching ...

Learning where to learn: Training data distribution optimization for scientific machine learning

Nicolas Guerra, Nicholas H. Nelsen, Yunan Yang

TL;DR

This work tackles the challenge of distribution shift in scientific machine learning by proposing to design the training data distribution itself, ν, to minimize average deployment error across a family of regimes. It develops two principled optimization approaches—a bilevel framework in RKHS and an alternating, upper-bound-based scheme—that operate over probability measures and can be implemented with parametric or nonparametric (particle-based) representations. Theoretical Lipschitz-based OOD bounds and average-case performance analyses inform the algorithms, while numerical experiments on function approximation and PDE operator learning (including EIT, Darcy flow, radiative transport, and Burgers) demonstrate significant reductions in out-of-distribution error and improved sample efficiency. The work highlights intelligent data acquisition as a core component of SciML workflows and provides practical methods to tailor training data to complex deployment regimes, with extensible architectures and publicly available code for reproducibility.

Abstract

In scientific machine learning, models are routinely deployed with parameter values or boundary conditions far from those used in training. This paper studies the learning-where-to-learn problem of designing a training data distribution that minimizes average prediction error across a family of deployment regimes. A theoretical analysis shows how the training distribution shapes deployment accuracy. This motivates two adaptive algorithms based on bilevel or alternating optimization in the space of probability measures. Discretized implementations using parametric distribution classes or nonparametric particle-based gradient flows deliver optimized training distributions that outperform nonadaptive designs. Once trained, the resulting models exhibit improved sample complexity and robustness to distribution shift. This framework unlocks the potential of principled data acquisition for learning functions and solution operators of partial differential equations.

Learning where to learn: Training data distribution optimization for scientific machine learning

TL;DR

This work tackles the challenge of distribution shift in scientific machine learning by proposing to design the training data distribution itself, ν, to minimize average deployment error across a family of regimes. It develops two principled optimization approaches—a bilevel framework in RKHS and an alternating, upper-bound-based scheme—that operate over probability measures and can be implemented with parametric or nonparametric (particle-based) representations. Theoretical Lipschitz-based OOD bounds and average-case performance analyses inform the algorithms, while numerical experiments on function approximation and PDE operator learning (including EIT, Darcy flow, radiative transport, and Burgers) demonstrate significant reductions in out-of-distribution error and improved sample efficiency. The work highlights intelligent data acquisition as a core component of SciML workflows and provides practical methods to tailor training data to complex deployment regimes, with extensible architectures and publicly available code for reproducibility.

Abstract

In scientific machine learning, models are routinely deployed with parameter values or boundary conditions far from those used in training. This paper studies the learning-where-to-learn problem of designing a training data distribution that minimizes average prediction error across a family of deployment regimes. A theoretical analysis shows how the training distribution shapes deployment accuracy. This motivates two adaptive algorithms based on bilevel or alternating optimization in the space of probability measures. Discretized implementations using parametric distribution classes or nonparametric particle-based gradient flows deliver optimized training distributions that outperform nonadaptive designs. Once trained, the resulting models exhibit improved sample complexity and robustness to distribution shift. This framework unlocks the potential of principled data acquisition for learning functions and solution operators of partial differential equations.

Paper Structure

This paper contains 67 sections, 15 theorems, 105 equations, 18 figures, 3 tables, 3 algorithms.

Key Result

Proposition 3.1

Let $\mathcal{G}_1$ and $\mathcal{G}_2$ both be Lipschitz continuous maps from Hilbert space $\mathcal{U}$ to Hilbert space $\mathcal{Y}$. For any $\nu\in\mathscr{P}_1(\mathcal{U})$ and $\nu'\in\mathscr{P}_1(\mathcal{U})$, it holds that Moreover, for any $\mu\in\mathscr{P}_2(\mathcal{U})$ and $\mu'\in\mathscr{P}_2(\mathcal{U})$, it holds that where $c(\mathcal{G}_1,\mathcal{G}_2,\mu,\mu')$ equal

Figures (18)

  • Figure 1: Two conductivity samples in EIT. The true conductivity is on the left of each panel, followed by predictions from NIO models trained on four Dirichlet boundary condition distributions $\{\nu_i\}_{i=1}^4$. Top row: in-distribution predictions; bottom row: out-of-distribution predictions. See SM\ref{['app:architectures']}--\ref{['app:details_numeric']} for details.
  • Figure 2: Alg. \ref{['alg:bilevel']} applied to ground truth $g_1\colon\mathbb{R}^2\to\mathbb{R}$. (Left) Evolution of $\mathsf{Err}$ over $1000$ iterations ($N=250$ samples per step) of gradient descent. (Center) $\mathsf{Err}$ of model trained on $N$ samples from optimized $\nu_\vartheta$ (ours) vs. initial normal $\mathcal{N}(m_0, I_2)$, empirical $\mathbb{Q}$$\mathsf{W}_2$-barycenter, empirical $\mathbb{Q}$ mixture, $\mathrm{Unif}([0,1]^2)$, and two pool-based coresets. (Right) Same as center, except incorporating the additional function evaluation cost incurred from Alg. \ref{['alg:bilevel']}. Shading represents two standard deviations away from the mean $\mathsf{Err}$ over $10$ independent runs.
  • Figure 3: (Left) Matrix approximation of NtD map. (Center) After $80$ independent runs, decay of the average relative OOD error of the model when trained on the optimal distribution identified at each iteration of Alg. \ref{['alg:alter']}; a $95$% confidence interval of the true relative OOD error is provided at each iteration. (Right) Decay of average AMA loss defined in \ref{['eqn:AMAobjective']} vs. iteration relative to the same loss at initialization.
  • Figure 4: The first two images show the test conductivity $a$ and true PDE solution $u$. The next four images show the absolute error $\abstemp{u_\text{pred}-u}$ of the model trained on the training distribution from iterations 1, 2, 3, and 8.
  • Figure 5: Relative OOD error vs. sample size $N$ for learning the radiative transport solution operator. Results are shown for Knudsen numbers $\varepsilon \in \{1/8, 2, 8\}$. For each $N$, each panel displays 95% confidence intervals over 10 trials for three DeepONet models trained for 5000 epochs: the initial model, the model after particle-based AMA, and a benchmark trained the test distribution mixture $\nu_\mathbb{Q}\coloneqq \frac{1}{3}\sum_{k=1}^3\nu_k'$.
  • ...and 13 more figures

Theorems & Definitions (45)

  • Definition 2.1: $p$-Wasserstein distance
  • Definition 2.2: random measure
  • Proposition 3.1: distribution shift error
  • Corollary 3.2: basic inequality
  • Remark 4.1: finite data
  • Lemma 4.2: adjoint state equation
  • Proposition 4.3: derivative: infinite-dimensional case
  • Remark 4.4: Wasserstein gradient flow
  • Theorem 4.5: gradient: parametric case
  • Example 4.6: Gaussian parametrization
  • ...and 35 more