Table of Contents
Fetching ...

Optimization for Neural Operators can Benefit from Width

Pedro Cisneros-Velarde, Bhavesh Shrimali, Arindam Banerjee

TL;DR

This work addresses the lack of optimization guarantees for neural operators by developing a unified gradient-descent framework built on two key conditions: $\alpha_t$-restricted strong convexity and $\beta$-smoothness. It shows that both Deep Operator Networks (DONs) and Fourier Neural Operators (FNOs) satisfy these conditions when the networks are sufficiently wide, despite their architectural differences, and it provides probabilistic guarantees for loss decrease under gradient descent. The main theoretical contributions are explicit expressions for the RSC parameter $\alpha_t$ and the smoothness constant $\beta$ that depend on network width, depth, and initialization, along with over-parameterization benefits that enlarge the convergence neighborhood. Complementing the theory, the authors demonstrate empirically that increasing width reduces training loss and speeds convergence on canonical operator-learning tasks (antiderivative, diffusion-reaction, and Burgers’ equation), highlighting the practical significance of width in neural operator optimization.

Abstract

Neural Operators that directly learn mappings between function spaces, such as Deep Operator Networks (DONs) and Fourier Neural Operators (FNOs), have received considerable attention. Despite the universal approximation guarantees for DONs and FNOs, there is currently no optimization convergence guarantee for learning such networks using gradient descent (GD). In this paper, we address this open problem by presenting a unified framework for optimization based on GD and applying it to establish convergence guarantees for both DONs and FNOs. In particular, we show that the losses associated with both of these neural operators satisfy two conditions -- restricted strong convexity (RSC) and smoothness -- that guarantee a decrease on their loss values due to GD. Remarkably, these two conditions are satisfied for each neural operator due to different reasons associated with the architectural differences of the respective models. One takeaway that emerges from the theory is that wider networks should lead to better optimization convergence for both DONs and FNOs. We present empirical results on canonical operator learning problems to support our theoretical results.

Optimization for Neural Operators can Benefit from Width

TL;DR

This work addresses the lack of optimization guarantees for neural operators by developing a unified gradient-descent framework built on two key conditions: -restricted strong convexity and -smoothness. It shows that both Deep Operator Networks (DONs) and Fourier Neural Operators (FNOs) satisfy these conditions when the networks are sufficiently wide, despite their architectural differences, and it provides probabilistic guarantees for loss decrease under gradient descent. The main theoretical contributions are explicit expressions for the RSC parameter and the smoothness constant that depend on network width, depth, and initialization, along with over-parameterization benefits that enlarge the convergence neighborhood. Complementing the theory, the authors demonstrate empirically that increasing width reduces training loss and speeds convergence on canonical operator-learning tasks (antiderivative, diffusion-reaction, and Burgers’ equation), highlighting the practical significance of width in neural operator optimization.

Abstract

Neural Operators that directly learn mappings between function spaces, such as Deep Operator Networks (DONs) and Fourier Neural Operators (FNOs), have received considerable attention. Despite the universal approximation guarantees for DONs and FNOs, there is currently no optimization convergence guarantee for learning such networks using gradient descent (GD). In this paper, we address this open problem by presenting a unified framework for optimization based on GD and applying it to establish convergence guarantees for both DONs and FNOs. In particular, we show that the losses associated with both of these neural operators satisfy two conditions -- restricted strong convexity (RSC) and smoothness -- that guarantee a decrease on their loss values due to GD. Remarkably, these two conditions are satisfied for each neural operator due to different reasons associated with the architectural differences of the respective models. One takeaway that emerges from the theory is that wider networks should lead to better optimization convergence for both DONs and FNOs. We present empirical results on canonical operator learning problems to support our theoretical results.

Paper Structure

This paper contains 34 sections, 22 theorems, 153 equations, 10 figures, 1 table.

Key Result

Theorem 1

Consider Assumption asmp:iter-0 and Conditions cond:rsc and cond:smooth with $\alpha_t \leq \beta$ at step $t$ of the GD update eq:gd_at_t with step-size $\eta_t=\frac{\omega_t}{\beta}$ for some $\omega_t \in(0,2)$. If ${\mathcal{L}}({\bm{\theta}}_t)\neq \underset{{\bm{\theta}} \in {\cal B}({\bm{\th

Figures (10)

  • Figure 1: Training progress of DONs as measured by the empirical loss \ref{['eq:empirical_risk']} over 80,000 epochs. The $y$-axis is plotted on a log-scale and the $x$-axis denotes the training epochs % 100 (i.e., the loss is stored at every 100th epoch). Wider networks typically lead to lower loss for all three problems.
  • Figure 2: Training progress of FNOs as measured by the empirical loss \ref{['eq:fno_loss']} over 80,000 epochs. The setting of the plots is similar to Figure \ref{['fig:seLU_Loss_DON']}. Wider networks typically lead to lower loss for all three problems.
  • Figure 3: A schematic of the DON architecture by lu20201DeepONet used in our study. We refer to the notation used in our paper. Note that the input functions need not be sampled on a structured grid of points.
  • Figure 4: A schematic of the FNO architecture by li_fourier_2021 used in our study. We refer to the notation used in our paper. "Spectral convolution" and "bypass convolution" are terms used in the FNO literature to denote the effect of the linear mappings in the spectral and spatial domain, respectively.
  • Figure 5: Sample solutions obtained for the Antiderivative operator for DONs for $m\in \{10, 50, 500\}$ at the end of the training process (80,000 epochs) for a randomly chosen input function. The "data" refers to the ground truth (obtained by a standard numerical solver) and "pred" corresponds to the learned operator.
  • ...and 5 more figures

Theorems & Definitions (38)

  • Definition 1: Restricted strong convexity (RSC)
  • Theorem 1: Global loss reduction
  • Remark 1: The RSC to smoothness ratio
  • Definition 2: $Q^{t}_{\kappa}$ sets for DONs
  • Theorem 2: RSC for DONs
  • Theorem 3: Smoothness for DONs
  • Remark 2: Ensuring that $\alpha_t/\beta<1$
  • Remark 3: The benefit of over-parameterization for the RSC property
  • Remark 4: Over-parameterization allows for a larger neighborhood around initialization
  • Definition 3: $Q^{t}_{\kappa}$ sets for FNOs
  • ...and 28 more