Optimization for Neural Operators can Benefit from Width
Pedro Cisneros-Velarde, Bhavesh Shrimali, Arindam Banerjee
TL;DR
This work addresses the lack of optimization guarantees for neural operators by developing a unified gradient-descent framework built on two key conditions: $\alpha_t$-restricted strong convexity and $\beta$-smoothness. It shows that both Deep Operator Networks (DONs) and Fourier Neural Operators (FNOs) satisfy these conditions when the networks are sufficiently wide, despite their architectural differences, and it provides probabilistic guarantees for loss decrease under gradient descent. The main theoretical contributions are explicit expressions for the RSC parameter $\alpha_t$ and the smoothness constant $\beta$ that depend on network width, depth, and initialization, along with over-parameterization benefits that enlarge the convergence neighborhood. Complementing the theory, the authors demonstrate empirically that increasing width reduces training loss and speeds convergence on canonical operator-learning tasks (antiderivative, diffusion-reaction, and Burgers’ equation), highlighting the practical significance of width in neural operator optimization.
Abstract
Neural Operators that directly learn mappings between function spaces, such as Deep Operator Networks (DONs) and Fourier Neural Operators (FNOs), have received considerable attention. Despite the universal approximation guarantees for DONs and FNOs, there is currently no optimization convergence guarantee for learning such networks using gradient descent (GD). In this paper, we address this open problem by presenting a unified framework for optimization based on GD and applying it to establish convergence guarantees for both DONs and FNOs. In particular, we show that the losses associated with both of these neural operators satisfy two conditions -- restricted strong convexity (RSC) and smoothness -- that guarantee a decrease on their loss values due to GD. Remarkably, these two conditions are satisfied for each neural operator due to different reasons associated with the architectural differences of the respective models. One takeaway that emerges from the theory is that wider networks should lead to better optimization convergence for both DONs and FNOs. We present empirical results on canonical operator learning problems to support our theoretical results.
