Learning Operators by Regularized Stochastic Gradient Descent with Operator-valued Kernels
Jia-Qi Yang, Lei Shi
TL;DR
The paper develops a rigorous theory for learning regression operators from a Polish input space to a Hilbert-valued output using vector-valued RKHSs induced by operator-valued kernels. By formulating the problem as regularized SGD in infinite-dimensional spaces and translating the nonlinear operator regression into a linear operator regression via a Hilbert–Schmidt map, the authors derive dimension-free, near-optimal convergence rates in both online (decaying $\eta_t$ and $\lambda_t$) and finite-horizon (constant parameters) settings. They provide comprehensive error analyses, including an error decomposition into approximation, initialization, drift, and sampling components, and establish both expectation-based and high-probability bounds, with explicit rates depending on regularity $r$ and capacity $s$. The results advance operator learning with regularization, offering probabilistic guarantees in infinite dimensions and enabling extensions to general kernels, structured prediction, and PCA-based encoder–decoder frameworks, with implications for real-time, discretization-invariant learning of solution operators for parameterized PDEs and related tasks.
Abstract
We consider a class of statistical inverse problems involving the estimation of a regression operator from a Polish space to a separable Hilbert space, where the target lies in a vector-valued reproducing kernel Hilbert space induced by an operator-valued kernel. To address the associated ill-posedness, we analyze regularized stochastic gradient descent (SGD) algorithms in both online and finite-horizon settings. The former uses polynomially decaying step sizes and regularization parameters, while the latter adopts fixed values. Under suitable structural and distributional assumptions, we establish dimension-independent bounds for prediction and estimation errors. The resulting convergence rates are near-optimal in expectation, and we also derive high-probability estimates that imply almost sure convergence. Our analysis introduces a general technique for obtaining high-probability guarantees in infinite-dimensional settings. Possible extensions to broader kernel classes and encoder-decoder structures are briefly discussed.
