Table of Contents
Fetching ...

Bilevel optimization for learning hyperparameters: Application to solving PDEs and inverse problems with Gaussian processes

Nicholas H. Nelsen, Houman Owhadi, Andrew M. Stuart, Xianjin Yang, Zongren Zou

TL;DR

This work develops a bilevel framework for learning kernel and model hyperparameters in GP-based PDE and inverse problem solvers. By replacing full inner solves with a single Gauss--Newton linearization, it reduces each outer iteration to a linear state solve plus explicit hyperparameter updates, enabling scalable hypergradient computation. The authors demonstrate substantial gains in accuracy and robustness across nonlinear elliptic PDEs, Schrödinger equations, reaction-diffusion systems, Eikonal and Burgers’ equations, and Darcy inverse problems, including high-dimensional hyperparameters and deep kernels. The approach supports both Optimize-Then-Discretize and Discretize-Then-Optimize variants and highlights the practical impact of learned hyperparameters on generalization and stability in physics-informed learning.

Abstract

Methods for solving scientific computing and inference problems, such as kernel- and neural network-based approaches for partial differential equations (PDEs), inverse problems, and supervised learning tasks, depend crucially on the choice of hyperparameters. Specifically, the efficacy of such methods, and in particular their accuracy, stability, and generalization properties, strongly depends on the choice of hyperparameters. While bilevel optimization offers a principled framework for hyperparameter tuning, its nested optimization structure can be computationally demanding, especially in PDE-constrained contexts. In this paper, we propose an efficient strategy for hyperparameter optimization within the bilevel framework by employing a Gauss-Newton linearization of the inner optimization step. Our approach provides closed-form updates, eliminating the need for repeated costly PDE solves. As a result, each iteration of the outer loop reduces to a single linearized PDE solve, followed by explicit gradient-based hyperparameter updates. We demonstrate the effectiveness of the proposed method through Gaussian process models applied to nonlinear PDEs and to PDE inverse problems. Extensive numerical experiments highlight substantial improvements in accuracy and robustness compared to conventional random hyperparameter initialization. In particular, experiments with additive kernels and neural network-parameterized deep kernels demonstrate the method's scalability and effectiveness for high-dimensional hyperparameter optimization.

Bilevel optimization for learning hyperparameters: Application to solving PDEs and inverse problems with Gaussian processes

TL;DR

This work develops a bilevel framework for learning kernel and model hyperparameters in GP-based PDE and inverse problem solvers. By replacing full inner solves with a single Gauss--Newton linearization, it reduces each outer iteration to a linear state solve plus explicit hyperparameter updates, enabling scalable hypergradient computation. The authors demonstrate substantial gains in accuracy and robustness across nonlinear elliptic PDEs, Schrödinger equations, reaction-diffusion systems, Eikonal and Burgers’ equations, and Darcy inverse problems, including high-dimensional hyperparameters and deep kernels. The approach supports both Optimize-Then-Discretize and Discretize-Then-Optimize variants and highlights the practical impact of learned hyperparameters on generalization and stability in physics-informed learning.

Abstract

Methods for solving scientific computing and inference problems, such as kernel- and neural network-based approaches for partial differential equations (PDEs), inverse problems, and supervised learning tasks, depend crucially on the choice of hyperparameters. Specifically, the efficacy of such methods, and in particular their accuracy, stability, and generalization properties, strongly depends on the choice of hyperparameters. While bilevel optimization offers a principled framework for hyperparameter tuning, its nested optimization structure can be computationally demanding, especially in PDE-constrained contexts. In this paper, we propose an efficient strategy for hyperparameter optimization within the bilevel framework by employing a Gauss-Newton linearization of the inner optimization step. Our approach provides closed-form updates, eliminating the need for repeated costly PDE solves. As a result, each iteration of the outer loop reduces to a single linearized PDE solve, followed by explicit gradient-based hyperparameter updates. We demonstrate the effectiveness of the proposed method through Gaussian process models applied to nonlinear PDEs and to PDE inverse problems. Extensive numerical experiments highlight substantial improvements in accuracy and robustness compared to conventional random hyperparameter initialization. In particular, experiments with additive kernels and neural network-parameterized deep kernels demonstrate the method's scalability and effectiveness for high-dimensional hyperparameter optimization.

Paper Structure

This paper contains 26 sections, 95 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Left: $L^2$ errors of GP solutions chen2021solving to \ref{['eq:nonlinear_elliptic']} versus kernel lengthscale $\theta$ and training size $M$. Right: zoom for $\theta\in[0.05,0.35]$. Accuracy improves with $M$ but is sensitive to $\theta$; when $\theta$ lies far outside the low-error region, additional increases in $M$ yield minimal gains.
  • Figure 2: Bilevel hyperparameter learning. At iteration $k$, given $(u^k,\theta^k)$ we (i) define the Gauss--Newton map $\Pi_k(\theta)$ by solving the linearized training constraint $R_{\mathrm{train}}(u^k)+D_u R_{\mathrm{train}}(u^k)(u-u^k)=0$ with penalty $\|u\|_{\mathcal{U}_\theta}^2$; (ii) minimize the outer objective $\mathcal{J}_k(\theta)\coloneqq\tfrac{1}{2}\|R_{\mathrm{val}}(\Pi_k(\theta))\|^2+\mathcal{R}(\theta)$ to obtain $\theta^{k+1}\approx\arg\min_\theta \mathcal{J}_k(\theta)$ (e.g., Adam); and (iii) update the state via $u^{k+1}=\Pi_k(\theta^{k+1})$. The hypergradient $\nabla_\theta \mathcal{J}_k$ is computed by differentiating through $\Pi_k$. Here $D_u R_{\mathrm{train}}$ denotes the Fréchet derivative with respect to $u$.
  • Figure 3: (Left) Evolution of the learned lengthscale in the Gaussian kernel for solving the nonlinear elliptic equation \ref{['eq:elliptic']}, initialized from various starting values $l_0$. All trajectories converge to similar final values, indicating robustness of the learning procedure. (Right) Visualization of the landscape of the linearized outer DTO loss versus lengthscale $\theta=l$ of \ref{['eq:rbf:kernel']}. As the outer iteration index $k$ increases, the minimum value of the loss begins to stabilize near $l=0.2$, which is represented by the vertical gray dashed line.
  • Figure 4: Solving the complex-valued nonlinear Schrödinger equation with the GP-PDE method using the learned hyperparameters. The white and yellow "x" in the top figure represent the locations of the collocation and boundary points, respectively.
  • Figure 5: Solving the Gray--Scott equation using the GP-PDE method with learned hyperparameters. The white and yellow "x" markers indicate collocation and boundary point locations, respectively.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Remark 1.1
  • Remark 2.1: Extension to $K$-Fold Cross-Validation